statwiki - User contributions [US]

IPBoost

2020-12-04T06:53:32Z

Z42qin: /* Critique */

== Presented by ==
Casey De Vera, Solaiman Jawad

== Introduction ==
Boosting is an important and a fairly standard technique in a classification that combines several base learners (learners which have a “low accuracy”), into one boosted learner (learner with a “high accuracy”). Pioneered by the AdaBoost approach of Freund & Schapire, in recent decades there has been extensive work on boosting procedures and analyses of their limitations.

In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows:

Let <math>T</math> represents the number of iteration and <math>t</math> represents the iterator.

for <math> t= 1, \cdots, T </math> do the following:

::1. Train a learner <math> \mu_t</math> from a given class of base learners on the data distribution <math> \mathcal D_t</math>

::2. Evaluate the performance of <math> \mu_t</math> by computing its loss.

::3. Push weight of the data distribution <math> \mathcal D_t</math> towards the misclassified examples leading to <math> \mathcal D_{t+1}</math>, usually a relatively good model will have a higher weight on those data that misclassified.

Finally, the learners are combined with some form of voting (e.g., soft or hard voting, averaging, thresholding).

[[File:boosting.gif|200px|thumb|right]] A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function using coordinate gradient descent. Boosting schemes of this type are often referred to as convex potential boosters. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. However, they can be trounced by a small number of incorrect labels, which can be quite challenging to fix. We try to solve misclassified examples by moving the weights around, which results in bad performance on unseen data (Shown by zoomed example to the right). In fact, in theory, providing the class of base learners is rich enough, a perfect strong learner can be constructed with accuracy 1; however, clearly, such a learner might not necessarily generalize well. Boosted learners can generate some quite complicated decision boundaries, much more complicated than the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line, and XGBoost was used to learn a classifier:

Recently non-convex optimization approaches for solving machine learning problems have gained significant attention. In this paper, the non-convex boosting in classification using integer programming has been explored, and the real-world practicability of the approach while circumventing shortcomings of convex boosting approaches is also shown. The paper reports results that are comparable to or better than current state-of-the-art approaches.

== Motivation ==

In reality, we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. Noisy labels are an issue because when the writer trained a model with excessive amounts of noisy labels, the performance and accuracy deteriorate significantly. However, if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case, the model cannot correctly classify these examples, and the procedure shifts more and more weight towards these bad examples. It eventually leads to a strong learner that accurately predicts the (flawed) training data; however, that does not generalize well anymore. This intuition has been formalized by [LS], who construct a "hard" training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in these boosted learners' performance; see the tables below. The more technical reason for this problem is the convexity of the loss function minimized by the boosting procedure. One can use all types of "tricks," such as early stopping, but this is not solving the fundamental problem at the end of the day.

== IPBoost: Boosting via Integer Programming ==

===Integer Program Formulation===
Let <math>(x_1,y_1),\cdots, (x_N,y_N) </math> be the training set with points <math>x_i \in \mathbb{R}^d</math> and two-class labels <math>y_i \in \{\pm 1\}</math>
* class of base learners: <math> \Omega :=\{h_1, \cdots, h_L: \mathbb{R}^d \rightarrow \{\pm 1\}\} </math> and <math>\rho \ge 0</math> be given.
* error function <math> \eta </math>
Our boosting model is captured by the integer programming problem. We can call this our primal problem:

$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j=1}^L \eta_{ij}\lambda_k+(1+\rho)z_i \ge \rho \ \ \
\forall i=1,\cdots, N \\
&\sum_{j=1}^L \lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$

For the error function <math>\eta</math>, three options were considered:

(i) <math> \pm 1 </math> classification from learners
$$ \eta_{ij} := 2\mathbb{I}[h_j(x_i) = y_i] - 1 = y_i \cdot h_j(x_i) $$
(ii) class probabilities of learners
$$ \eta_{ij} := 2\mathbb{P}[h_j(x_i) = y_i] - 1$$
(iii) SAMME.R error function for learners
$$ \frac{1}{2}y_i\log\left(\frac{\mathbb{P}[h_j(x_i) = 1]}{\mathbb{P}[h_j(x_i) = -1]}\right)$$

===Solution of the IP using Column Generation===

The goal of column generation is to provide an efficient way to solve the primal's linear programming relaxation by allowing the zi variables to assume fractional values. Moreover, columns, i.e., the base learners, <math> \mathcal L \subseteq [L]. </math> . are left out because there are too many to handle efficiently, and most of them will have their associated weight equal to zero in the optimal solution. A branch and bound framework is used To generate columns. Columns are created within a branch-and-bound framework leading effectively to a branch-and-bound-and-price algorithm being used; this is significantly more involved compared to column generation in linear programming. To check the optimality of an LP solution, a subproblem, called the pricing problem, is solved to identify columns with a profitable reduced cost. If such columns are found, the LP is re-optimized. Branching occurs when no profitable columns are found, but the LP solution does not satisfy the integrality conditions. Branch and price apply column generation at every node of the branch and bound tree.

In reality, we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. Noisy labels are an issue because when the writer trained a model with excessive amounts of noisy labels, the performance and accuracy deteriorate significantly. However, if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case, the model cannot correctly classify these examples, and the procedure shifts more and more weight towards these bad examples. It eventually leads to a strong learner that accurately predicts the (flawed) training data; however, that does not generalize well anymore. This intuition has been formalized by [LS], who construct a "hard" training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in these boosted learners' performance; see the tables below. The more technical reason for this problem is the convexity of the loss function minimized by the boosting procedure. One can use all types of "tricks," such as early stopping, but this is not solving the fundamental problem at the end of the day.

The restricted master primal problem is

$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j\in \mathcal L} \eta_{ij}\lambda_j+(1+\rho)z_i \ge \rho \ \ \
\forall i \in [N]\\
&\sum_{j\in \mathcal L}\lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$

Its restricted dual problem is:

$$ \begin{align*}\max \rho &\sum^{N}_{i=1}w_i + v - \sum^{N}_{i=1}u_i
\\ s.t. &\sum_{i=1}^N \eta_{ij}w_k+ v \le 0 \ \ \ \forall j \in L \\
&(1+\rho)w_i - u_i \le 1 \ \ \ \forall i \in [N] \\ &w \ge 0, u \ge 0, v\ free\end{align*}$$

Furthermore, there is a pricing problem used to determine, for every supposed optimal solution of the dual, whether the solution is actually optimal, or whether further constraints need to be added into the primal solution. With this pricing problem, we check whether the solution to the restricted dual is feasible. This pricing problem can be expressed as follows:

$$ \sum_{i=1}^N \eta_{ij}w_k^* + v^* > 0 $$

The optimal misclassification values are determined by a branch-and-price process that branches on the variables <math> z_i </math> and solves the intermediate LPs using column generation.

===Algorithm===
<div style="margin-left: 3em;">
<math> D = \{(x_i, y_i) | i ∈ I\} ⊆ R^d × \{±1\} </math>, class of base learners <math>Ω </math>, margin <math> \rho </math> <br>
'''Output:''' Boosted learner <math> \sum_{j∈L^∗}h_jλ_j^* </math> with base learners <math> h_j </math> and weights <math> λ_j^* </math> <br>

<ol>

<li margin-left:30px> <math> T ← \{([0, 1]^N, \emptyset)\} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // set of local bounds and learners for open subproblems </li>
<li> <math> U ← \infty, L^∗ ← \emptyset </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Upper bound on optimal objective </li>
<li> '''while''' <math>\ T \neq \emptyset </math> '''do''' </li>
<li> &emsp; Choose and remove <math>(B,L) </math> from <math>T </math> </li>
<li> &emsp; '''repeat''' </li>
<li> &emsp; &emsp; Solve the primal IP using the local bounds on <math> z </math> in <math>B</math> with optimal dual solution <math> (w^∗, v^∗, u^∗) </math> </li>
<li> &emsp; &emsp; Find learner <math> h_j ∈ Ω </math> satisfying the pricing problem. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Solve pricing problem. </li>
<li> &emsp; '''until''' <math> h_j </math> is not found </li>
<li> &emsp; Let <math> (\widetilde{λ} , \widetilde{z}) </math> be the final solution of the primal IP with base learners <math> \widetilde{L} = \{j | \widetilde{λ}_j > 0\} </math> </li>
<li> &emsp; '''if''' <math> \widetilde{z} ∈ \mathbb{Z}^N </math> and <math> \sum^{N}_{i=1}\widetilde{z}_i < U </math> '''then''' </li>
<li> &emsp; &emsp; <math> U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Update best solution. </li>
<li> &emsp; '''else''' </li>
<li> &emsp; &emsp; Choose <math> i ∈ [N] </math> with <math> \widetilde{z}_i \notin Z </math> </li>
<li> &emsp; &emsp; Set <math> B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} </math> </li>
<li> &emsp; &emsp; Add <math> (B_0,\widetilde{L}), (B_1,\widetilde{L}) </math> to <math> T </math>. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Create new branching nodes. </li>
<li> &emsp; '''end''' if </li>
<li> '''end''' while </li>
<li> ''Optionally sparsify final solution <math>L^*</math>'' </li>

</ol>
</div>

===Sparsification===

When building the boosting model, one of the challenges is to balance model accuracy and generalization. A sparse model is generally easier to interpret. In order to control complexity, overfitting, and generalization of the model, typically some sparsity is applied, which is easier to interpret for some applications. The goal of sparsification is to minimize the following: \begin{equation}\sum_{i=1}^{N}z_i+\sum_{i=1}^{L}\alpha_iy_i\end{equation} There are two techniques commonly used. One is early stopping, another is regularization by adding a complexity term for the learners in the objective function. Previous approaches in the context of LP-based boosting have promoted sparsity by means of cutting planes. Sparsification can be handled in the approach introduced in this paper by solving a delayed integer program using additional cutting planes.

== Results and Performance ==

''All tests were run on identical Linux clusters with Intel Xeon Quad Core CPUs, with 3.50GHz, 10 MB cache, and 32 GB of main memory.''

The following results reflect IPBoost's performance in hard instances. Note that by hard instances, we mean a binary classification problem with predefined labels. These examples are tailored to using the ±1 classification from learners. On every hard instance sample, IPBoost significantly outperforms both LPBoost and AdaBoost (although implementations depending on the libraries used have often caused results to differ slightly). For the considered instances the best value for the margin ρ was 0.05 for LPBoost and IPBoost; AdaBoost has no margin parameter. The accuracy reported is test accuracy recorded across various different walkthroughs of the algorithm, while <math>L </math> denotes the aggregate number of learners required to find the optimal learner, N is the number of points and <math> \gamma </math> refers to the noise level.

[[File:ipboostres.png|center]]

For the next table, the classification instances from LIBSVM data sets available at [https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/]. We report accuracies on the test set and train set, respectively. In each case, we report the averages of the accuracies over 10 runs with a different random seed and their standard deviations. We can see IPboost again outperforming LPBoost and AdaBoost significantly. Solving Integer Programming problems is no doubt more computationally expensive than traditional boosting methods like AdaBoost. The average run time of IPBoost (for ρ = 0.05) being 1367.78 seconds, as opposed to LPBoost's 164.35 seconds and AdaBoost's 3.59 seconds reflects exactly that. However, on the flip side, we gain much better stability in our results, as well as higher scores across the board for both training and test sets.

[[file:svmlibres.png|center]]

== Conclusion ==

IP-boosting avoids the bad performance on well-known hard classes and improves upon LP-boosting and AdaBoost on the LIBSVM instances where even a few percent improvements is valuable. The major drawback is that the running time with the current implementation is much longer. Nevertheless, the algorithm can be improved in the future by solving the intermediate LPs only approximately and deriving tailored heuristics that generate decent primal solutions to save on time.

The approach is suited very well to an offline setting in which training may take time and where even a small improvement is beneficial or when convex boosters have egregious behaviour. It can also be served as a tool to investigate the general performance of methods like this.

The IPBoost algorithm added extra complexity into basic boosting models to gain slight accuracy gain while greatly increased the time spent. It is 381 times slower compared to an AdaBoost model on a small dataset which makes the actual usage of this model doubtable. If we are supplied with a larger dataset with millions of records this model would take too long to complete. The base classifier choice was XGBoost which is too complicated for a base classifier, maybe try some weaker learners such as tree stumps to compare the result with other models. In addition, this model might not be accurate compared to the model-ensembling technique where each model utilizes a different algorithm.

I think IP Boost can be helpful in cases where a small improvement in accuracy is worth the additional computational effort involved. For example, there are applications in Finance where predicting the movement of stock prices slightly more accurately can help traders make millions more. In such cases, investing in expensive computing systems to implement such boosting techniques can perhaps be justified.

Although the IPBoost can generate better results compared to the base boosting models such as AdaBoost. The time complexity of IPBoost is much higher than other models. Hence, it will be problem to apply IPBoost to the real-world training data since those training data will be large and it's hard to deal with. Thus, we are left with the question that this IPBoost is only a little bit more accurate than the base boosting models but a lot more time complex. So is the model really worth it?

== Critique ==

The introduction section should change the sentence"3. Push weight of the data ... will have a higher weight on those data that misclassified." to "... will have a higher weight on those misclassified data.

The column generation method is used when there are many variables compared to the number of constraints. If there are only a few variables compared to the number of limitations, using this method will not be very efficient.It would be better if the efficiency of different boosting models is tested.

In practice, AdaBoost & LP Boost are quite robust to overfitting. In a way, if you use simple weak learns stumps (1-level decision trees), then the algorithms are much less prone to overfitting. However, the noise level in the data, particularly for AdaBoost, is prone to overfitting noisy datasets. To avoid this use, regularized models (AdaBoostReg, LPBoost). Further, when in higher dimensional spaces, AdaBoost can suffer since it's just a linear combination of classifiers. You can use k-fold cross-validation to set the stopping parameter to improve this.

== References ==

* Pfetsch, M. E., & Pokutta, S. (2020). IPBoost--Non-Convex Boosting via Integer Programming. arXiv preprint arXiv:2002.04679.

* Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23-37). Springer, Berlin, Heidelberg. pdf

IPBoost

2020-12-04T06:50:32Z

Z42qin: /* Solution of the IP using Column Generation */

== Presented by ==
Casey De Vera, Solaiman Jawad

== Introduction ==
Boosting is an important and a fairly standard technique in a classification that combines several base learners (learners which have a “low accuracy”), into one boosted learner (learner with a “high accuracy”). Pioneered by the AdaBoost approach of Freund & Schapire, in recent decades there has been extensive work on boosting procedures and analyses of their limitations.

In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows:

Let <math>T</math> represents the number of iteration and <math>t</math> represents the iterator.

for <math> t= 1, \cdots, T </math> do the following:

::1. Train a learner <math> \mu_t</math> from a given class of base learners on the data distribution <math> \mathcal D_t</math>

::2. Evaluate the performance of <math> \mu_t</math> by computing its loss.

::3. Push weight of the data distribution <math> \mathcal D_t</math> towards the misclassified examples leading to <math> \mathcal D_{t+1}</math>, usually a relatively good model will have a higher weight on those data that misclassified.

Finally, the learners are combined with some form of voting (e.g., soft or hard voting, averaging, thresholding).

[[File:boosting.gif|200px|thumb|right]] A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function using coordinate gradient descent. Boosting schemes of this type are often referred to as convex potential boosters. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. However, they can be trounced by a small number of incorrect labels, which can be quite challenging to fix. We try to solve misclassified examples by moving the weights around, which results in bad performance on unseen data (Shown by zoomed example to the right). In fact, in theory, providing the class of base learners is rich enough, a perfect strong learner can be constructed with accuracy 1; however, clearly, such a learner might not necessarily generalize well. Boosted learners can generate some quite complicated decision boundaries, much more complicated than the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line, and XGBoost was used to learn a classifier:

Recently non-convex optimization approaches for solving machine learning problems have gained significant attention. In this paper, the non-convex boosting in classification using integer programming has been explored, and the real-world practicability of the approach while circumventing shortcomings of convex boosting approaches is also shown. The paper reports results that are comparable to or better than current state-of-the-art approaches.

== Motivation ==

In reality, we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. Noisy labels are an issue because when the writer trained a model with excessive amounts of noisy labels, the performance and accuracy deteriorate significantly. However, if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case, the model cannot correctly classify these examples, and the procedure shifts more and more weight towards these bad examples. It eventually leads to a strong learner that accurately predicts the (flawed) training data; however, that does not generalize well anymore. This intuition has been formalized by [LS], who construct a "hard" training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in these boosted learners' performance; see the tables below. The more technical reason for this problem is the convexity of the loss function minimized by the boosting procedure. One can use all types of "tricks," such as early stopping, but this is not solving the fundamental problem at the end of the day.

== IPBoost: Boosting via Integer Programming ==

===Integer Program Formulation===
Let <math>(x_1,y_1),\cdots, (x_N,y_N) </math> be the training set with points <math>x_i \in \mathbb{R}^d</math> and two-class labels <math>y_i \in \{\pm 1\}</math>
* class of base learners: <math> \Omega :=\{h_1, \cdots, h_L: \mathbb{R}^d \rightarrow \{\pm 1\}\} </math> and <math>\rho \ge 0</math> be given.
* error function <math> \eta </math>
Our boosting model is captured by the integer programming problem. We can call this our primal problem:

$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j=1}^L \eta_{ij}\lambda_k+(1+\rho)z_i \ge \rho \ \ \
\forall i=1,\cdots, N \\
&\sum_{j=1}^L \lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$

For the error function <math>\eta</math>, three options were considered:

(i) <math> \pm 1 </math> classification from learners
$$ \eta_{ij} := 2\mathbb{I}[h_j(x_i) = y_i] - 1 = y_i \cdot h_j(x_i) $$
(ii) class probabilities of learners
$$ \eta_{ij} := 2\mathbb{P}[h_j(x_i) = y_i] - 1$$
(iii) SAMME.R error function for learners
$$ \frac{1}{2}y_i\log\left(\frac{\mathbb{P}[h_j(x_i) = 1]}{\mathbb{P}[h_j(x_i) = -1]}\right)$$

===Solution of the IP using Column Generation===

The goal of column generation is to provide an efficient way to solve the primal's linear programming relaxation by allowing the zi variables to assume fractional values. Moreover, columns, i.e., the base learners, <math> \mathcal L \subseteq [L]. </math> . are left out because there are too many to handle efficiently, and most of them will have their associated weight equal to zero in the optimal solution. A branch and bound framework is used To generate columns. Columns are created within a branch-and-bound framework leading effectively to a branch-and-bound-and-price algorithm being used; this is significantly more involved compared to column generation in linear programming. To check the optimality of an LP solution, a subproblem, called the pricing problem, is solved to identify columns with a profitable reduced cost. If such columns are found, the LP is re-optimized. Branching occurs when no profitable columns are found, but the LP solution does not satisfy the integrality conditions. Branch and price apply column generation at every node of the branch and bound tree.

In reality, we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. Noisy labels are an issue because when the writer trained a model with excessive amounts of noisy labels, the performance and accuracy deteriorate significantly. However, if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case, the model cannot correctly classify these examples, and the procedure shifts more and more weight towards these bad examples. It eventually leads to a strong learner that accurately predicts the (flawed) training data; however, that does not generalize well anymore. This intuition has been formalized by [LS], who construct a "hard" training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in these boosted learners' performance; see the tables below. The more technical reason for this problem is the convexity of the loss function minimized by the boosting procedure. One can use all types of "tricks," such as early stopping, but this is not solving the fundamental problem at the end of the day.

The restricted master primal problem is

$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j\in \mathcal L} \eta_{ij}\lambda_j+(1+\rho)z_i \ge \rho \ \ \
\forall i \in [N]\\
&\sum_{j\in \mathcal L}\lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$

Its restricted dual problem is:

$$ \begin{align*}\max \rho &\sum^{N}_{i=1}w_i + v - \sum^{N}_{i=1}u_i
\\ s.t. &\sum_{i=1}^N \eta_{ij}w_k+ v \le 0 \ \ \ \forall j \in L \\
&(1+\rho)w_i - u_i \le 1 \ \ \ \forall i \in [N] \\ &w \ge 0, u \ge 0, v\ free\end{align*}$$

Furthermore, there is a pricing problem used to determine, for every supposed optimal solution of the dual, whether the solution is actually optimal, or whether further constraints need to be added into the primal solution. With this pricing problem, we check whether the solution to the restricted dual is feasible. This pricing problem can be expressed as follows:

$$ \sum_{i=1}^N \eta_{ij}w_k^* + v^* > 0 $$

The optimal misclassification values are determined by a branch-and-price process that branches on the variables <math> z_i </math> and solves the intermediate LPs using column generation.

===Algorithm===
<div style="margin-left: 3em;">
<math> D = \{(x_i, y_i) | i ∈ I\} ⊆ R^d × \{±1\} </math>, class of base learners <math>Ω </math>, margin <math> \rho </math> <br>
'''Output:''' Boosted learner <math> \sum_{j∈L^∗}h_jλ_j^* </math> with base learners <math> h_j </math> and weights <math> λ_j^* </math> <br>

<ol>

<li margin-left:30px> <math> T ← \{([0, 1]^N, \emptyset)\} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // set of local bounds and learners for open subproblems </li>
<li> <math> U ← \infty, L^∗ ← \emptyset </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Upper bound on optimal objective </li>
<li> '''while''' <math>\ T \neq \emptyset </math> '''do''' </li>
<li> &emsp; Choose and remove <math>(B,L) </math> from <math>T </math> </li>
<li> &emsp; '''repeat''' </li>
<li> &emsp; &emsp; Solve the primal IP using the local bounds on <math> z </math> in <math>B</math> with optimal dual solution <math> (w^∗, v^∗, u^∗) </math> </li>
<li> &emsp; &emsp; Find learner <math> h_j ∈ Ω </math> satisfying the pricing problem. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Solve pricing problem. </li>
<li> &emsp; '''until''' <math> h_j </math> is not found </li>
<li> &emsp; Let <math> (\widetilde{λ} , \widetilde{z}) </math> be the final solution of the primal IP with base learners <math> \widetilde{L} = \{j | \widetilde{λ}_j > 0\} </math> </li>
<li> &emsp; '''if''' <math> \widetilde{z} ∈ \mathbb{Z}^N </math> and <math> \sum^{N}_{i=1}\widetilde{z}_i < U </math> '''then''' </li>
<li> &emsp; &emsp; <math> U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Update best solution. </li>
<li> &emsp; '''else''' </li>
<li> &emsp; &emsp; Choose <math> i ∈ [N] </math> with <math> \widetilde{z}_i \notin Z </math> </li>
<li> &emsp; &emsp; Set <math> B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} </math> </li>
<li> &emsp; &emsp; Add <math> (B_0,\widetilde{L}), (B_1,\widetilde{L}) </math> to <math> T </math>. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Create new branching nodes. </li>
<li> &emsp; '''end''' if </li>
<li> '''end''' while </li>
<li> ''Optionally sparsify final solution <math>L^*</math>'' </li>

</ol>
</div>

===Sparsification===

When building the boosting model, one of the challenges is to balance model accuracy and generalization. A sparse model is generally easier to interpret. In order to control complexity, overfitting, and generalization of the model, typically some sparsity is applied, which is easier to interpret for some applications. The goal of sparsification is to minimize the following: \begin{equation}\sum_{i=1}^{N}z_i+\sum_{i=1}^{L}\alpha_iy_i\end{equation} There are two techniques commonly used. One is early stopping, another is regularization by adding a complexity term for the learners in the objective function. Previous approaches in the context of LP-based boosting have promoted sparsity by means of cutting planes. Sparsification can be handled in the approach introduced in this paper by solving a delayed integer program using additional cutting planes.

== Results and Performance ==

''All tests were run on identical Linux clusters with Intel Xeon Quad Core CPUs, with 3.50GHz, 10 MB cache, and 32 GB of main memory.''

The following results reflect IPBoost's performance in hard instances. Note that by hard instances, we mean a binary classification problem with predefined labels. These examples are tailored to using the ±1 classification from learners. On every hard instance sample, IPBoost significantly outperforms both LPBoost and AdaBoost (although implementations depending on the libraries used have often caused results to differ slightly). For the considered instances the best value for the margin ρ was 0.05 for LPBoost and IPBoost; AdaBoost has no margin parameter. The accuracy reported is test accuracy recorded across various different walkthroughs of the algorithm, while <math>L </math> denotes the aggregate number of learners required to find the optimal learner, N is the number of points and <math> \gamma </math> refers to the noise level.

[[File:ipboostres.png|center]]

For the next table, the classification instances from LIBSVM data sets available at [https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/]. We report accuracies on the test set and train set, respectively. In each case, we report the averages of the accuracies over 10 runs with a different random seed and their standard deviations. We can see IPboost again outperforming LPBoost and AdaBoost significantly. Solving Integer Programming problems is no doubt more computationally expensive than traditional boosting methods like AdaBoost. The average run time of IPBoost (for ρ = 0.05) being 1367.78 seconds, as opposed to LPBoost's 164.35 seconds and AdaBoost's 3.59 seconds reflects exactly that. However, on the flip side, we gain much better stability in our results, as well as higher scores across the board for both training and test sets.

[[file:svmlibres.png|center]]

== Conclusion ==

IP-boosting avoids the bad performance on well-known hard classes and improves upon LP-boosting and AdaBoost on the LIBSVM instances where even a few percent improvements is valuable. The major drawback is that the running time with the current implementation is much longer. Nevertheless, the algorithm can be improved in the future by solving the intermediate LPs only approximately and deriving tailored heuristics that generate decent primal solutions to save on time.

The approach is suited very well to an offline setting in which training may take time and where even a small improvement is beneficial or when convex boosters have egregious behaviour. It can also be served as a tool to investigate the general performance of methods like this.

The IPBoost algorithm added extra complexity into basic boosting models to gain slight accuracy gain while greatly increased the time spent. It is 381 times slower compared to an AdaBoost model on a small dataset which makes the actual usage of this model doubtable. If we are supplied with a larger dataset with millions of records this model would take too long to complete. The base classifier choice was XGBoost which is too complicated for a base classifier, maybe try some weaker learners such as tree stumps to compare the result with other models. In addition, this model might not be accurate compared to the model-ensembling technique where each model utilizes a different algorithm.

I think IP Boost can be helpful in cases where a small improvement in accuracy is worth the additional computational effort involved. For example, there are applications in Finance where predicting the movement of stock prices slightly more accurately can help traders make millions more. In such cases, investing in expensive computing systems to implement such boosting techniques can perhaps be justified.

Although the IPBoost can generate better results compared to the base boosting models such as AdaBoost. The time complexity of IPBoost is much higher than other models. Hence, it will be problem to apply IPBoost to the real-world training data since those training data will be large and it's hard to deal with. Thus, we are left with the question that this IPBoost is only a little bit more accurate than the base boosting models but a lot more time complex. So is the model really worth it?

== Critique ==

In the introduction section, should change the sentence"3. Push weight of the data ... will have a higher weight on those data that misclassified." to "... will have a higher weight on those data that are misclassified.

The column generation method is used when there are a huge number of variables compared to the number of constraints. If there are only a few variables compared to the number of constraints, using this method will not be very efficient.

It would be better if the efficiency of different boosting models are tested.

In practice AdaBoost & LP Boost are quite robust to over fitting. In practice if you use simple weak learns stumps (1-level decision trees) then the algorithms are much less prone to over fitting. However the noise level in the data, particularly for AdaBoost is prone to overfitting on noisy datasets. To avoid this use regularized models (AdaBoostReg, LPBoost). Further, when in higher dimensional spaces, AdaBoost can suffer, since its just a linear combination of classifiers. You can use k-fold cross validation to set the stopping parameter to improve this.

== References ==

* Pfetsch, M. E., & Pokutta, S. (2020). IPBoost--Non-Convex Boosting via Integer Programming. arXiv preprint arXiv:2002.04679.

* Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23-37). Springer, Berlin, Heidelberg. pdf

IPBoost

2020-12-04T06:45:54Z

Z42qin: /* Motivation */

== Presented by ==
Casey De Vera, Solaiman Jawad

== Introduction ==
Boosting is an important and a fairly standard technique in a classification that combines several base learners (learners which have a “low accuracy”), into one boosted learner (learner with a “high accuracy”). Pioneered by the AdaBoost approach of Freund & Schapire, in recent decades there has been extensive work on boosting procedures and analyses of their limitations.

In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows:

Let <math>T</math> represents the number of iteration and <math>t</math> represents the iterator.

for <math> t= 1, \cdots, T </math> do the following:

::1. Train a learner <math> \mu_t</math> from a given class of base learners on the data distribution <math> \mathcal D_t</math>

::2. Evaluate the performance of <math> \mu_t</math> by computing its loss.

::3. Push weight of the data distribution <math> \mathcal D_t</math> towards the misclassified examples leading to <math> \mathcal D_{t+1}</math>, usually a relatively good model will have a higher weight on those data that misclassified.

Finally, the learners are combined with some form of voting (e.g., soft or hard voting, averaging, thresholding).

[[File:boosting.gif|200px|thumb|right]] A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function using coordinate gradient descent. Boosting schemes of this type are often referred to as convex potential boosters. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. However, they can be trounced by a small number of incorrect labels, which can be quite challenging to fix. We try to solve misclassified examples by moving the weights around, which results in bad performance on unseen data (Shown by zoomed example to the right). In fact, in theory, providing the class of base learners is rich enough, a perfect strong learner can be constructed with accuracy 1; however, clearly, such a learner might not necessarily generalize well. Boosted learners can generate some quite complicated decision boundaries, much more complicated than the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line, and XGBoost was used to learn a classifier:

Recently non-convex optimization approaches for solving machine learning problems have gained significant attention. In this paper, the non-convex boosting in classification using integer programming has been explored, and the real-world practicability of the approach while circumventing shortcomings of convex boosting approaches is also shown. The paper reports results that are comparable to or better than current state-of-the-art approaches.

== Motivation ==

In reality, we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. Noisy labels are an issue because when the writer trained a model with excessive amounts of noisy labels, the performance and accuracy deteriorate significantly. However, if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case, the model cannot correctly classify these examples, and the procedure shifts more and more weight towards these bad examples. It eventually leads to a strong learner that accurately predicts the (flawed) training data; however, that does not generalize well anymore. This intuition has been formalized by [LS], who construct a "hard" training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in these boosted learners' performance; see the tables below. The more technical reason for this problem is the convexity of the loss function minimized by the boosting procedure. One can use all types of "tricks," such as early stopping, but this is not solving the fundamental problem at the end of the day.

== IPBoost: Boosting via Integer Programming ==

===Integer Program Formulation===
Let <math>(x_1,y_1),\cdots, (x_N,y_N) </math> be the training set with points <math>x_i \in \mathbb{R}^d</math> and two-class labels <math>y_i \in \{\pm 1\}</math>
* class of base learners: <math> \Omega :=\{h_1, \cdots, h_L: \mathbb{R}^d \rightarrow \{\pm 1\}\} </math> and <math>\rho \ge 0</math> be given.
* error function <math> \eta </math>
Our boosting model is captured by the integer programming problem. We can call this our primal problem:

$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j=1}^L \eta_{ij}\lambda_k+(1+\rho)z_i \ge \rho \ \ \
\forall i=1,\cdots, N \\
&\sum_{j=1}^L \lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$

For the error function <math>\eta</math>, three options were considered:

(i) <math> \pm 1 </math> classification from learners
$$ \eta_{ij} := 2\mathbb{I}[h_j(x_i) = y_i] - 1 = y_i \cdot h_j(x_i) $$
(ii) class probabilities of learners
$$ \eta_{ij} := 2\mathbb{P}[h_j(x_i) = y_i] - 1$$
(iii) SAMME.R error function for learners
$$ \frac{1}{2}y_i\log\left(\frac{\mathbb{P}[h_j(x_i) = 1]}{\mathbb{P}[h_j(x_i) = -1]}\right)$$

===Solution of the IP using Column Generation===

The goal of column generation is to provide an efficient way to solve the linear programming relaxation of the primal by allowing the <math>z_i </math> variables to assume fractional values. Moreover, columns, i.e., the base learners, <math> \mathcal L \subseteq [L]. </math> are left out because there are too many to handle efficiently and most of them will have their associated weight equal to zero in the optimal solution anyway. To generate columns, a <i>branch and bound</i> framework is used. Columns are generated within a
branch-and-bound framework leading effectively to a branch-and-bound-and-price algorithm being used; this is significantly more involved compared to column generation in linear programming. To check the optimality of an LP solution, a subproblem, called the pricing problem, is solved to try to identify columns with a profitable reduced cost. If such columns are found, the LP is re-optimized. Branching occurs when no profitable columns are found, but the LP solution does not satisfy the integrality conditions. Branch and price apply column generation at every node of the branch and bound tree.

The restricted master primal problem is

$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j\in \mathcal L} \eta_{ij}\lambda_j+(1+\rho)z_i \ge \rho \ \ \
\forall i \in [N]\\
&\sum_{j\in \mathcal L}\lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$

Its restricted dual problem is:

$$ \begin{align*}\max \rho &\sum^{N}_{i=1}w_i + v - \sum^{N}_{i=1}u_i
\\ s.t. &\sum_{i=1}^N \eta_{ij}w_k+ v \le 0 \ \ \ \forall j \in L \\
&(1+\rho)w_i - u_i \le 1 \ \ \ \forall i \in [N] \\ &w \ge 0, u \ge 0, v\ free\end{align*}$$

Furthermore, there is a pricing problem used to determine, for every supposed optimal solution of the dual, whether the solution is actually optimal, or whether further constraints need to be added into the primal solution. With this pricing problem, we check whether the solution to the restricted dual is feasible. This pricing problem can be expressed as follows:

$$ \sum_{i=1}^N \eta_{ij}w_k^* + v^* > 0 $$

The optimal misclassification values are determined by a branch-and-price process that branches on the variables <math> z_i </math> and solves the intermediate LPs using column generation.

===Algorithm===
<div style="margin-left: 3em;">
<math> D = \{(x_i, y_i) | i ∈ I\} ⊆ R^d × \{±1\} </math>, class of base learners <math>Ω </math>, margin <math> \rho </math> <br>
'''Output:''' Boosted learner <math> \sum_{j∈L^∗}h_jλ_j^* </math> with base learners <math> h_j </math> and weights <math> λ_j^* </math> <br>

<ol>

<li margin-left:30px> <math> T ← \{([0, 1]^N, \emptyset)\} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // set of local bounds and learners for open subproblems </li>
<li> <math> U ← \infty, L^∗ ← \emptyset </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Upper bound on optimal objective </li>
<li> '''while''' <math>\ T \neq \emptyset </math> '''do''' </li>
<li> &emsp; Choose and remove <math>(B,L) </math> from <math>T </math> </li>
<li> &emsp; '''repeat''' </li>
<li> &emsp; &emsp; Solve the primal IP using the local bounds on <math> z </math> in <math>B</math> with optimal dual solution <math> (w^∗, v^∗, u^∗) </math> </li>
<li> &emsp; &emsp; Find learner <math> h_j ∈ Ω </math> satisfying the pricing problem. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Solve pricing problem. </li>
<li> &emsp; '''until''' <math> h_j </math> is not found </li>
<li> &emsp; Let <math> (\widetilde{λ} , \widetilde{z}) </math> be the final solution of the primal IP with base learners <math> \widetilde{L} = \{j | \widetilde{λ}_j > 0\} </math> </li>
<li> &emsp; '''if''' <math> \widetilde{z} ∈ \mathbb{Z}^N </math> and <math> \sum^{N}_{i=1}\widetilde{z}_i < U </math> '''then''' </li>
<li> &emsp; &emsp; <math> U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Update best solution. </li>
<li> &emsp; '''else''' </li>
<li> &emsp; &emsp; Choose <math> i ∈ [N] </math> with <math> \widetilde{z}_i \notin Z </math> </li>
<li> &emsp; &emsp; Set <math> B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} </math> </li>
<li> &emsp; &emsp; Add <math> (B_0,\widetilde{L}), (B_1,\widetilde{L}) </math> to <math> T </math>. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Create new branching nodes. </li>
<li> &emsp; '''end''' if </li>
<li> '''end''' while </li>
<li> ''Optionally sparsify final solution <math>L^*</math>'' </li>

</ol>
</div>

===Sparsification===

When building the boosting model, one of the challenges is to balance model accuracy and generalization. A sparse model is generally easier to interpret. In order to control complexity, overfitting, and generalization of the model, typically some sparsity is applied, which is easier to interpret for some applications. The goal of sparsification is to minimize the following: \begin{equation}\sum_{i=1}^{N}z_i+\sum_{i=1}^{L}\alpha_iy_i\end{equation} There are two techniques commonly used. One is early stopping, another is regularization by adding a complexity term for the learners in the objective function. Previous approaches in the context of LP-based boosting have promoted sparsity by means of cutting planes. Sparsification can be handled in the approach introduced in this paper by solving a delayed integer program using additional cutting planes.

== Results and Performance ==

''All tests were run on identical Linux clusters with Intel Xeon Quad Core CPUs, with 3.50GHz, 10 MB cache, and 32 GB of main memory.''

The following results reflect IPBoost's performance in hard instances. Note that by hard instances, we mean a binary classification problem with predefined labels. These examples are tailored to using the ±1 classification from learners. On every hard instance sample, IPBoost significantly outperforms both LPBoost and AdaBoost (although implementations depending on the libraries used have often caused results to differ slightly). For the considered instances the best value for the margin ρ was 0.05 for LPBoost and IPBoost; AdaBoost has no margin parameter. The accuracy reported is test accuracy recorded across various different walkthroughs of the algorithm, while <math>L </math> denotes the aggregate number of learners required to find the optimal learner, N is the number of points and <math> \gamma </math> refers to the noise level.

[[File:ipboostres.png|center]]

For the next table, the classification instances from LIBSVM data sets available at [https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/]. We report accuracies on the test set and train set, respectively. In each case, we report the averages of the accuracies over 10 runs with a different random seed and their standard deviations. We can see IPboost again outperforming LPBoost and AdaBoost significantly. Solving Integer Programming problems is no doubt more computationally expensive than traditional boosting methods like AdaBoost. The average run time of IPBoost (for ρ = 0.05) being 1367.78 seconds, as opposed to LPBoost's 164.35 seconds and AdaBoost's 3.59 seconds reflects exactly that. However, on the flip side, we gain much better stability in our results, as well as higher scores across the board for both training and test sets.

[[file:svmlibres.png|center]]

== Conclusion ==

IP-boosting avoids the bad performance on well-known hard classes and improves upon LP-boosting and AdaBoost on the LIBSVM instances where even a few percent improvements is valuable. The major drawback is that the running time with the current implementation is much longer. Nevertheless, the algorithm can be improved in the future by solving the intermediate LPs only approximately and deriving tailored heuristics that generate decent primal solutions to save on time.

The approach is suited very well to an offline setting in which training may take time and where even a small improvement is beneficial or when convex boosters have egregious behaviour. It can also be served as a tool to investigate the general performance of methods like this.

The IPBoost algorithm added extra complexity into basic boosting models to gain slight accuracy gain while greatly increased the time spent. It is 381 times slower compared to an AdaBoost model on a small dataset which makes the actual usage of this model doubtable. If we are supplied with a larger dataset with millions of records this model would take too long to complete. The base classifier choice was XGBoost which is too complicated for a base classifier, maybe try some weaker learners such as tree stumps to compare the result with other models. In addition, this model might not be accurate compared to the model-ensembling technique where each model utilizes a different algorithm.

I think IP Boost can be helpful in cases where a small improvement in accuracy is worth the additional computational effort involved. For example, there are applications in Finance where predicting the movement of stock prices slightly more accurately can help traders make millions more. In such cases, investing in expensive computing systems to implement such boosting techniques can perhaps be justified.

Although the IPBoost can generate better results compared to the base boosting models such as AdaBoost. The time complexity of IPBoost is much higher than other models. Hence, it will be problem to apply IPBoost to the real-world training data since those training data will be large and it's hard to deal with. Thus, we are left with the question that this IPBoost is only a little bit more accurate than the base boosting models but a lot more time complex. So is the model really worth it?

== Critique ==

In the introduction section, should change the sentence"3. Push weight of the data ... will have a higher weight on those data that misclassified." to "... will have a higher weight on those data that are misclassified.

The column generation method is used when there are a huge number of variables compared to the number of constraints. If there are only a few variables compared to the number of constraints, using this method will not be very efficient.

It would be better if the efficiency of different boosting models are tested.

In practice AdaBoost & LP Boost are quite robust to over fitting. In practice if you use simple weak learns stumps (1-level decision trees) then the algorithms are much less prone to over fitting. However the noise level in the data, particularly for AdaBoost is prone to overfitting on noisy datasets. To avoid this use regularized models (AdaBoostReg, LPBoost). Further, when in higher dimensional spaces, AdaBoost can suffer, since its just a linear combination of classifiers. You can use k-fold cross validation to set the stopping parameter to improve this.

== References ==

* Pfetsch, M. E., & Pokutta, S. (2020). IPBoost--Non-Convex Boosting via Integer Programming. arXiv preprint arXiv:2002.04679.

* Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23-37). Springer, Berlin, Heidelberg. pdf

IPBoost

2020-12-04T06:42:04Z

Z42qin: /* Introduction */

== Presented by ==
Casey De Vera, Solaiman Jawad

== Introduction ==
Boosting is an important and a fairly standard technique in a classification that combines several base learners (learners which have a “low accuracy”), into one boosted learner (learner with a “high accuracy”). Pioneered by the AdaBoost approach of Freund & Schapire, in recent decades there has been extensive work on boosting procedures and analyses of their limitations.

In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows:

Let <math>T</math> represents the number of iteration and <math>t</math> represents the iterator.

for <math> t= 1, \cdots, T </math> do the following:

::1. Train a learner <math> \mu_t</math> from a given class of base learners on the data distribution <math> \mathcal D_t</math>

::2. Evaluate the performance of <math> \mu_t</math> by computing its loss.

::3. Push weight of the data distribution <math> \mathcal D_t</math> towards the misclassified examples leading to <math> \mathcal D_{t+1}</math>, usually a relatively good model will have a higher weight on those data that misclassified.

Finally, the learners are combined with some form of voting (e.g., soft or hard voting, averaging, thresholding).

[[File:boosting.gif|200px|thumb|right]] A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function using coordinate gradient descent. Boosting schemes of this type are often referred to as convex potential boosters. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. However, they can be trounced by a small number of incorrect labels, which can be quite challenging to fix. We try to solve misclassified examples by moving the weights around, which results in bad performance on unseen data (Shown by zoomed example to the right). In fact, in theory, providing the class of base learners is rich enough, a perfect strong learner can be constructed with accuracy 1; however, clearly, such a learner might not necessarily generalize well. Boosted learners can generate some quite complicated decision boundaries, much more complicated than the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line, and XGBoost was used to learn a classifier:

Recently non-convex optimization approaches for solving machine learning problems have gained significant attention. In this paper, the non-convex boosting in classification using integer programming has been explored, and the real-world practicability of the approach while circumventing shortcomings of convex boosting approaches is also shown. The paper reports results that are comparable to or better than current state-of-the-art approaches.

== Motivation ==

In reality, we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. Noisy labels are an issue because when a model is trained with excessive amounts of noisy labels, the performance and accuracy deteriorate greatly. However, if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case, these examples cannot be correctly classified and the procedure shifts more and more weight towards these bad examples. This eventually leads to a strong learner, that perfectly predicts the (flawed) training data; however, that does not generalize well anymore. This intuition has been formalized by [LS] who construct a “hard” training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in performance of these boosted learners; see tables below. The more technical reason for this problem is actually the convexity of the loss function that is minimized by the boosting procedure. Clearly, one can use all types of “tricks” such as early stopping but at the end of the day, this is not solving the fundamental problem.

== IPBoost: Boosting via Integer Programming ==

===Integer Program Formulation===
Let <math>(x_1,y_1),\cdots, (x_N,y_N) </math> be the training set with points <math>x_i \in \mathbb{R}^d</math> and two-class labels <math>y_i \in \{\pm 1\}</math>
* class of base learners: <math> \Omega :=\{h_1, \cdots, h_L: \mathbb{R}^d \rightarrow \{\pm 1\}\} </math> and <math>\rho \ge 0</math> be given.
* error function <math> \eta </math>
Our boosting model is captured by the integer programming problem. We can call this our primal problem:

$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j=1}^L \eta_{ij}\lambda_k+(1+\rho)z_i \ge \rho \ \ \
\forall i=1,\cdots, N \\
&\sum_{j=1}^L \lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$

For the error function <math>\eta</math>, three options were considered:

(i) <math> \pm 1 </math> classification from learners
$$ \eta_{ij} := 2\mathbb{I}[h_j(x_i) = y_i] - 1 = y_i \cdot h_j(x_i) $$
(ii) class probabilities of learners
$$ \eta_{ij} := 2\mathbb{P}[h_j(x_i) = y_i] - 1$$
(iii) SAMME.R error function for learners
$$ \frac{1}{2}y_i\log\left(\frac{\mathbb{P}[h_j(x_i) = 1]}{\mathbb{P}[h_j(x_i) = -1]}\right)$$

===Solution of the IP using Column Generation===

The goal of column generation is to provide an efficient way to solve the linear programming relaxation of the primal by allowing the <math>z_i </math> variables to assume fractional values. Moreover, columns, i.e., the base learners, <math> \mathcal L \subseteq [L]. </math> are left out because there are too many to handle efficiently and most of them will have their associated weight equal to zero in the optimal solution anyway. To generate columns, a <i>branch and bound</i> framework is used. Columns are generated within a
branch-and-bound framework leading effectively to a branch-and-bound-and-price algorithm being used; this is significantly more involved compared to column generation in linear programming. To check the optimality of an LP solution, a subproblem, called the pricing problem, is solved to try to identify columns with a profitable reduced cost. If such columns are found, the LP is re-optimized. Branching occurs when no profitable columns are found, but the LP solution does not satisfy the integrality conditions. Branch and price apply column generation at every node of the branch and bound tree.

The restricted master primal problem is

$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j\in \mathcal L} \eta_{ij}\lambda_j+(1+\rho)z_i \ge \rho \ \ \
\forall i \in [N]\\
&\sum_{j\in \mathcal L}\lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$

Its restricted dual problem is:

$$ \begin{align*}\max \rho &\sum^{N}_{i=1}w_i + v - \sum^{N}_{i=1}u_i
\\ s.t. &\sum_{i=1}^N \eta_{ij}w_k+ v \le 0 \ \ \ \forall j \in L \\
&(1+\rho)w_i - u_i \le 1 \ \ \ \forall i \in [N] \\ &w \ge 0, u \ge 0, v\ free\end{align*}$$

Furthermore, there is a pricing problem used to determine, for every supposed optimal solution of the dual, whether the solution is actually optimal, or whether further constraints need to be added into the primal solution. With this pricing problem, we check whether the solution to the restricted dual is feasible. This pricing problem can be expressed as follows:

$$ \sum_{i=1}^N \eta_{ij}w_k^* + v^* > 0 $$

The optimal misclassification values are determined by a branch-and-price process that branches on the variables <math> z_i </math> and solves the intermediate LPs using column generation.

===Algorithm===
<div style="margin-left: 3em;">
<math> D = \{(x_i, y_i) | i ∈ I\} ⊆ R^d × \{±1\} </math>, class of base learners <math>Ω </math>, margin <math> \rho </math> <br>
'''Output:''' Boosted learner <math> \sum_{j∈L^∗}h_jλ_j^* </math> with base learners <math> h_j </math> and weights <math> λ_j^* </math> <br>

<ol>

<li margin-left:30px> <math> T ← \{([0, 1]^N, \emptyset)\} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // set of local bounds and learners for open subproblems </li>
<li> <math> U ← \infty, L^∗ ← \emptyset </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Upper bound on optimal objective </li>
<li> '''while''' <math>\ T \neq \emptyset </math> '''do''' </li>
<li> &emsp; Choose and remove <math>(B,L) </math> from <math>T </math> </li>
<li> &emsp; '''repeat''' </li>
<li> &emsp; &emsp; Solve the primal IP using the local bounds on <math> z </math> in <math>B</math> with optimal dual solution <math> (w^∗, v^∗, u^∗) </math> </li>
<li> &emsp; &emsp; Find learner <math> h_j ∈ Ω </math> satisfying the pricing problem. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Solve pricing problem. </li>
<li> &emsp; '''until''' <math> h_j </math> is not found </li>
<li> &emsp; Let <math> (\widetilde{λ} , \widetilde{z}) </math> be the final solution of the primal IP with base learners <math> \widetilde{L} = \{j | \widetilde{λ}_j > 0\} </math> </li>
<li> &emsp; '''if''' <math> \widetilde{z} ∈ \mathbb{Z}^N </math> and <math> \sum^{N}_{i=1}\widetilde{z}_i < U </math> '''then''' </li>
<li> &emsp; &emsp; <math> U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Update best solution. </li>
<li> &emsp; '''else''' </li>
<li> &emsp; &emsp; Choose <math> i ∈ [N] </math> with <math> \widetilde{z}_i \notin Z </math> </li>
<li> &emsp; &emsp; Set <math> B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} </math> </li>
<li> &emsp; &emsp; Add <math> (B_0,\widetilde{L}), (B_1,\widetilde{L}) </math> to <math> T </math>. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Create new branching nodes. </li>
<li> &emsp; '''end''' if </li>
<li> '''end''' while </li>
<li> ''Optionally sparsify final solution <math>L^*</math>'' </li>

</ol>
</div>

===Sparsification===

When building the boosting model, one of the challenges is to balance model accuracy and generalization. A sparse model is generally easier to interpret. In order to control complexity, overfitting, and generalization of the model, typically some sparsity is applied, which is easier to interpret for some applications. The goal of sparsification is to minimize the following: \begin{equation}\sum_{i=1}^{N}z_i+\sum_{i=1}^{L}\alpha_iy_i\end{equation} There are two techniques commonly used. One is early stopping, another is regularization by adding a complexity term for the learners in the objective function. Previous approaches in the context of LP-based boosting have promoted sparsity by means of cutting planes. Sparsification can be handled in the approach introduced in this paper by solving a delayed integer program using additional cutting planes.

== Results and Performance ==

''All tests were run on identical Linux clusters with Intel Xeon Quad Core CPUs, with 3.50GHz, 10 MB cache, and 32 GB of main memory.''

The following results reflect IPBoost's performance in hard instances. Note that by hard instances, we mean a binary classification problem with predefined labels. These examples are tailored to using the ±1 classification from learners. On every hard instance sample, IPBoost significantly outperforms both LPBoost and AdaBoost (although implementations depending on the libraries used have often caused results to differ slightly). For the considered instances the best value for the margin ρ was 0.05 for LPBoost and IPBoost; AdaBoost has no margin parameter. The accuracy reported is test accuracy recorded across various different walkthroughs of the algorithm, while <math>L </math> denotes the aggregate number of learners required to find the optimal learner, N is the number of points and <math> \gamma </math> refers to the noise level.

[[File:ipboostres.png|center]]

For the next table, the classification instances from LIBSVM data sets available at [https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/]. We report accuracies on the test set and train set, respectively. In each case, we report the averages of the accuracies over 10 runs with a different random seed and their standard deviations. We can see IPboost again outperforming LPBoost and AdaBoost significantly. Solving Integer Programming problems is no doubt more computationally expensive than traditional boosting methods like AdaBoost. The average run time of IPBoost (for ρ = 0.05) being 1367.78 seconds, as opposed to LPBoost's 164.35 seconds and AdaBoost's 3.59 seconds reflects exactly that. However, on the flip side, we gain much better stability in our results, as well as higher scores across the board for both training and test sets.

[[file:svmlibres.png|center]]

== Conclusion ==

IP-boosting avoids the bad performance on well-known hard classes and improves upon LP-boosting and AdaBoost on the LIBSVM instances where even a few percent improvements is valuable. The major drawback is that the running time with the current implementation is much longer. Nevertheless, the algorithm can be improved in the future by solving the intermediate LPs only approximately and deriving tailored heuristics that generate decent primal solutions to save on time.

The approach is suited very well to an offline setting in which training may take time and where even a small improvement is beneficial or when convex boosters have egregious behaviour. It can also be served as a tool to investigate the general performance of methods like this.

The IPBoost algorithm added extra complexity into basic boosting models to gain slight accuracy gain while greatly increased the time spent. It is 381 times slower compared to an AdaBoost model on a small dataset which makes the actual usage of this model doubtable. If we are supplied with a larger dataset with millions of records this model would take too long to complete. The base classifier choice was XGBoost which is too complicated for a base classifier, maybe try some weaker learners such as tree stumps to compare the result with other models. In addition, this model might not be accurate compared to the model-ensembling technique where each model utilizes a different algorithm.

I think IP Boost can be helpful in cases where a small improvement in accuracy is worth the additional computational effort involved. For example, there are applications in Finance where predicting the movement of stock prices slightly more accurately can help traders make millions more. In such cases, investing in expensive computing systems to implement such boosting techniques can perhaps be justified.

Although the IPBoost can generate better results compared to the base boosting models such as AdaBoost. The time complexity of IPBoost is much higher than other models. Hence, it will be problem to apply IPBoost to the real-world training data since those training data will be large and it's hard to deal with. Thus, we are left with the question that this IPBoost is only a little bit more accurate than the base boosting models but a lot more time complex. So is the model really worth it?

== Critique ==

In the introduction section, should change the sentence"3. Push weight of the data ... will have a higher weight on those data that misclassified." to "... will have a higher weight on those data that are misclassified.

The column generation method is used when there are a huge number of variables compared to the number of constraints. If there are only a few variables compared to the number of constraints, using this method will not be very efficient.

It would be better if the efficiency of different boosting models are tested.

In practice AdaBoost & LP Boost are quite robust to over fitting. In practice if you use simple weak learns stumps (1-level decision trees) then the algorithms are much less prone to over fitting. However the noise level in the data, particularly for AdaBoost is prone to overfitting on noisy datasets. To avoid this use regularized models (AdaBoostReg, LPBoost). Further, when in higher dimensional spaces, AdaBoost can suffer, since its just a linear combination of classifiers. You can use k-fold cross validation to set the stopping parameter to improve this.

== References ==

* Pfetsch, M. E., & Pokutta, S. (2020). IPBoost--Non-Convex Boosting via Integer Programming. arXiv preprint arXiv:2002.04679.

* Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23-37). Springer, Berlin, Heidelberg. pdf

Neural ODEs

2020-12-04T06:27:53Z

Z42qin: /* Scope and Limitations */

== Introduction ==
Chen et al. propose a new class of neural networks called neural ordinary differential equations (ODEs) in their 2018 paper under the same title. Neural network models, such as residual or recurrent networks, can be generalized as a set of transformations through hidden states (a.k.a layers) <math>\mathbf{h}</math>, given by the equation

<div style="text-align:center;"><math> \mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t,\theta_t) </math> (1) </div>

where <math>t \in \{0,...,T\}</math> and <math>\theta_t</math> corresponds to the set of parameters or weights in state <math>t</math>. It is important to note that it has been shown (Lu et al., 2017)(Haber
and Ruthotto, 2017)(Ruthotto and Haber, 2018) that Equation 1 can be viewed as an Euler discretization. Given this Euler description, if the number of layers and step size between layers are taken to their limits, then Equation 1 can instead be described continuously in the form of the ODE,

<div style="text-align:center;"><math> \frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t),t,\theta) </math> (2). </div>

Equation 2 now describes a network where the output layer <math>\mathbf{h}(T)</math> is generated by solving for the ODE at time <math>T</math>, given the initial value at <math>t=0</math>, where <math>\mathbf{h}(0)</math> is the input layer of the network.

With a vast amount of theory and research in the field of solving ODEs numerically, there are a number of benefits to formulating the hidden state dynamics this way. One major advantage is that a continuous description of the network allows for the calculation of <math>f</math> at arbitrary intervals and locations. The authors provide an example in section five of how the neural ODE network outperforms the discretized version i.e. residual networks, by taking advantage of the continuity of <math>f</math>. A depiction of this distinction is shown in the figure below.

<div style="text-align:center;"> [[File:NeuralODEs_Fig1.png|350px]] </div>

In section four the authors show that the single-unit bottleneck of normalizing flows can be overcome by constructing a new class of density models that incorporates the neural ODE network formulation.
The next section on automatic differentiation will describe how utilizing ODE solvers allows for the calculation of gradients of the loss function without storing any of the hidden state information. This results in a very low memory requirement for neural ODE networks in comparison to traditional networks that rely on intermediate hidden state quantities for backpropagation.

== Reverse-mode Automatic Differentiation of ODE Solutions ==
Like most neural networks, optimizing the weight parameters <math>\theta</math> for a neural ODE network involves finding the gradient of a loss function with respect to those parameters. Differentiating in the forward direction is a simple task, however, this method is very computationally expensive and unstable, as it introduces additional numerical error. Instead, the authors suggest that the gradients can be calculated in the reverse-mode with the adjoint sensitivity method (Pontryagin et al., 1962). This "backpropagation" method solves an augmented version of the forward ODE problem but in reverse, which is something that all ODE solvers are capable of. Section 3 provides results showing that this method gives very desirable memory costs and numerical stability.

The authors provide an example of the adjoint method by considering the minimization of the scalar-valued loss function <math>L</math>, which takes the solution of the ODE solver as its argument.

<div style="text-align:center;">[[File:NeuralODEs_Eq1.png|700px]],</div>
This minimization problem requires the calculation of <math>\frac{\partial L}{\partial \mathbf{z}(t_0)}</math> and <math>\frac{\partial L}{\partial \theta}</math>.

The adjoint itself is defined as <math>\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{z}(t)}</math>, which describes the gradient of the loss with respect to the hidden state <math>\mathbf{z}(t)</math>. By taking the first derivative of the adjoint, another ODE arises in the form of,

<div style="text-align:center;"><math>\frac{d \mathbf{a}(t)}{dt} = -\mathbf{a}(t)^T \frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \mathbf{z}}</math> (3).</div>

Since the value <math>\mathbf{a}(t_0)</math> is required to minimize the loss, the ODE in equation 3 must be solved backwards in time from <math>\mathbf{a}(t_1)</math>. Solving this problem is dependent on the knowledge of the hidden state <math>\mathbf{z}(t)</math> for all <math>t</math>, which an neural ODE does not save on the forward pass. Luckily, both <math>\mathbf{a}(t)</math> and <math>\mathbf{z}(t)</math> can be calculated in reverse, at the same time, by setting up an augmented version of the dynamics and is shown in the final algorithm. Finally, the derivative <math>dL/d\theta</math> can be expressed in terms of the adjoint and the hidden state as,

<div style="text-align:center;"><math> \frac{dL}{d\theta} -\int_{t_1}^{t_0} \mathbf{a}(t)^T\frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \theta}dt</math> (4).</div>

To obtain very inexpensive calculations of <math>\frac{\partial f}{\partial z}</math> and <math>\frac{\partial f}{\partial \theta}</math> in equation 3 and 4, automatic differentiation can be utilized. The authors present an algorithm to calculate the gradients of <math>L</math> and their dependent quantities with only one call to an ODE solver and is shown below.

<div style="text-align:center;">[[File:NeuralODEs Algorithm1.png|850px]]</div>

If the loss function has a stronger dependence on the hidden states for <math>t \neq t_0,t_1</math>, then Algorithm 1 can be modified to handle multiple calls to the ODESolve step since most ODE solvers have the capability to provide <math>z(t)</math> at arbitrary times. A visual depiction of this scenario is shown below.

<div style="text-align:center;">[[File:NeuralODES Fig2.png|350px]]</div>

Please see the [https://arxiv.org/pdf/1806.07366.pdf#page=13 appendix] for extended versions of Algorithm 1 and detailed derivations of each equation in this section.

== Replacing Residual Networks with ODEs for Supervised Learning ==
Section three of the paper investigates an application of the reverse-mode differentiation described in section two, for the training of neural ODE networks on the MNIST digit data set. To solve for the forward pass in the neural ODE network, the following experiment used Adams-Moulton (AM) method, which is an implicit ODE solver. Although it has a marked improvement over explicit ODE solvers in numerical accuracy, integrating backward through the network for backpropagation is still not preferred and the adjoint sensitivity method is used to perform efficient weight optimization. The network with this "backpropagation" technique is referred to as ODE-Net in this section.

=== Implementation ===
A residual network (ResNet), studied by He et al. (2016), with six standard residual blocks was used as a comparative model for this experiment. The competing model, ODE-net, replaces the residual blocks of the ResNet with the AM solver. As a hybrid of the two models ResNet and ODE-net, a third network was created called RK-Net, which solves the weight optimization of the neural ODE network explicitly through backward Runge-Kutta integration. The following table shows the training and performance results of each network.

<div style="text-align:center;">[[File:NeuralODEs Table1.png|400px]]</div>

Note that <math>L</math> and <math>\tilde{L}</math> are the number of layers in ResNet and the number of function calls that the AM method makes for the two ODE networks and are effectively analogous quantities. As shown in Table 1, both of the ODE networks achieve comparable performance to that of the ResNet with a notable decrease in memory cost for ODE-net.

Another interesting component of ODE networks is the ability to control the tolerance in the ODE solver used and subsequently the numerical error in the solution.

<div style="text-align:center;">[[File:NeuralODEs Fig3.png|700px]]</div>

The tolerance of the ODE solver is represented by the color bar in Figure 3 above and notice that a variety of effects arise from adjusting this parameter. Primarily, if one was to treat the tolerance as a hyperparameter of sorts, you could tune it such that you find a balance between accuracy (Figure 3a) and computational complexity (Figure 3b). Figure 3c also provides further evidence for the benefits of the adjoint method for the backward pass in ODE-nets since there is a nearly 1:0.5 ratio of forward to backward function calls. In the ResNet and RK-Net examples, this ratio is 1:1.

Additionally, the authors loosely define the concept of depth in a neural ODE network by referring to Figure 3d. Here it's evident that as you continue to train an ODE network, the number of function evaluations the ODE solver performs increases. As previously mentioned, this quantity is comparable to the network depth of a discretized network. However, as the authors note, this result should be seen as the progression of the network's complexity over training epochs, which is something we expect to increase over time.

== Continuous Normalizing Flows ==

Section four tackles the implementation of continuous-depth Neural Networks, but to do so, in the first part of section four the authors discuss theoretically how to establish this kind of network through the use of normalizing flows. The authors use a change of variables method presented in other works (Rezende and Mohamed, 2015), (Dinh et al., 2014), to compute the change of a probability distribution if sample points are transformed through a bijective function, <math>f</math>.

<div style="text-align:center;"><math>z_1=f(z_0) \Rightarrow \log⁡(p(z_1))=\log⁡(p(z_0))-\log⁡|\det⁡\frac{\partial f}{\partial z_0}|</math></div>

Where p(z) is the probability distribution of the samples and <math>det⁡\frac{\partial f}{\partial z_0}</math> is the determinant of the Jacobian which has a cubic cost in the dimension of '''z''' or the number of hidden units in the network. The authors discovered however that transforming the discrete set of hidden layers in the normalizing flow network to continuous transformations simplifies the computations significantly, due primarily to the following theorem:

'''''Theorem 1:''' (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability p(z(t)) dependent on time. Let dz/dt=f(z(t),t) be a differential equation describing a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the change in log probability also follows a differential equation:''

<div style="text-align:center;"><math>\frac{\partial \log(p(z(t)))}{\partial t}=-tr\left(\frac{df}{dz(t)}\right)</math></div>

The biggest advantage to using this theorem is that the trace function is a linear function, so if the dynamics of the problem, f, is represented by a sum of functions, then so is the log density. This essentially means that you can now compute flow models with only a linear cost with respect to the number of hidden units, <math>M</math>. In standard normalizing flow models, the cost is <math>O(M^3)</math>, so they will generally fit many layers with a single hidden unit in each layer.

Finally the authors use these realizations to construct Continuous Normalizing Flow networks (CNFs) by specifying the parameters of the flow as a function of ''t'', ie, <math>f(z(t),t)</math>. They also use a gating mechanism for each hidden unit, <math>\frac{dz}{dt}=\sum_n \sigma_n(t)f_n(z)</math> where <math>\sigma_n(t)\in (0,1)</math> is a separate neural network which learns when to apply each dynamic <math>f_n</math>.

===Implementation===

The authors construct two separate types of neural networks to compare against each other, the first is the standard planar Normalizing Flow network (NF) using 64 layers of single hidden units, and the second is their new CNF with 64 hidden units. The NF model is trained over 500,000 iterations using RMSprop, and the CNF network is trained over 10,000 iterations using Adam(algorithm for first-order gradient-based optimization of stochastic objective functions). The loss function is <math>KL(q(x)||p(x))</math> where <math>q(x)</math> is the flow model and <math>p(x)</math> is the target probability density.

One of the biggest advantages when implementing CNF is that you can train the flow parameters just by performing maximum likelihood estimation on <math>\log(q(x))</math> given <math>p(x)</math>, where <math>q(x)</math> is found via the theorem above, and then reversing the CNF to generate random samples from <math>q(x)</math>. This reversal of the CNF is done with about the same cost of the forward pass which is not able to be done in an NF network. The following two figures demonstrate the ability of CNF to generate more expressive and accurate output data as compared to standard NF networks.

<div style="text-align:center;">
[[Image:CNFcomparisons.png]]

[[Image:CNFtransitions.png]]
</div>

Figure 4 shows clearly that the CNF structure exhibits significantly lower loss functions than NF. In figure 5 both networks were tasked with transforming a standard Gaussian distribution into a target distribution, not only was the CNF network more accurate on the two moons target, but also the steps it took along the way are much more intuitive than the output from NF.

== A Generative Latent Function Time-Series Model ==

One of the largest issues at play in terms of Neural ODE networks is the fact that in many instances, data points are either very sparsely distributed, or irregularly-sampled. The latent dynamics are discretized and the observations are in the bins of fixed duration. This creates issues with missing data and ill-defined latent variables. An example of this is medical records which are only updated when a patient visits a doctor or the hospital. To solve this issue the authors had to create a generative time-series model which would be able to fill in the gaps of missing data. The authors consider each time series as a latent trajectory stemming from the initial local state <math>z_{t_0 }</math> and determined from a global set of latent parameters. Given a set of observation times and initial state, the generative model constructs points via the following sample procedure:

<div style="text-align:center;">
<math>
z_{t_0}∼p(z_{t_0})
</math>
</div>

<div style="text-align:center;">
<math>
z_{t_1},z_{t_2},\dots,z_{t_N}={\rm ODESolve}(z_{t_0},f,θ_f,t_0,...,t_N)
</math>
</div>

<div style="text-align:center;">
each
<math>
x_{t_i}∼p(x│z_{t_i},θ_x)
</math>
</div>

<math>f</math> is a function which outputs the gradient <math>\frac{\partial z(t)}{\partial t}=f(z(t),θ_f)</math> which is parameterized via a neural net. In order to train this latent variable model, the authors had to first encode their given data and observation times using an RNN encoder, construct the new points using the trained parameters, then decode the points back into the original space. The following figure describes this process:

<div style="text-align:center;">
[[Image:EncodingFigure.png]]
</div>

Another variable which could affect the latent state of a time-series model is how often an event actually occurs. The authors solved this by parameterizing the rate of events in terms of a Poisson process. They described the set of independent observation times in an interval <math>\left[t_{start},t_{end}\right]</math> as:

<div style="text-align:center;">
<math>
{\rm log}⁡(p(t_1,t_2,\dots,t_N ))=\sum_{i=1}^N{\rm log}⁡(\lambda(z(t_i)))-\int_{t_{start}}^{t_{end}}λ(z(t))dt
</math>
</div>

where <math>\lambda(*)</math> is parameterized via another neural network.

===Implementation===

To test the effectiveness of the Latent time-series ODE model (LODE), they fit the encoder with 25 hidden units, parametrize function f with a one-layer 20 hidden unit network, and the decoder as another neural network with 20 hidden units. They compare this against a standard recurrent neural net (RNN) with 25 hidden units trained to minimize Gaussian log-likelihood. The authors tested both of these network systems on a dataset of 2-dimensional spirals which either rotated clockwise or counter-clockwise and sampled the positions of each spiral at 100 equally spaced time steps. They can then simulate irregularly timed data by taking random amounts of points without replacement from each spiral. The next two figures show the outcome of these experiments:

<div style="text-align:center;">
[[Image:LODEtestresults.png]] [[Image:SpiralFigure.png|The blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model]]
</div>

In the figure on the right the blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model. It is noted that the LODE performs significantly better than the standard RNN model, especially on smaller sets of data points.

== Scope and Limitations ==

This part mainly discusses the scope and limitations of the paper. Firstly, while "batching" the training data is a useful step in standard neural nets and can still be applied here by combining the ODEs associated with each batch, the authors found that controlling the error, in this case, may increase the number of calculations required. In practice, however, the number of calculations did not increase significantly.

So long as the model proposed in this paper uses finite weights and Lipschitz nonlinearities, Picard's existence theorem (Coddington and Levinson, 1955) applies, which guarantees that the solution to the IVP exists and is unique. This theorem holds for the model presented above when the network has finite weights and uses nonlinearities in the Lipshitz class.

In controlling the error amount in the model, the authors could only reduce tolerances to approximately 10−3 and 10−5 in classification and density estimation, respectively, without also degrading the computational performance.

The authors believe that reconstructing state trajectories by running the dynamics backward can introduce extra numerical error. They address a possible solution to this problem by checkpointing specific time steps and storing intermediate values of z on the forward pass. Then while reconstructing, it does each part individually between checkpoints. The authors acknowledged that they informally checked this method's validity since they do not consider it a practical problem.

There remain, however, areas where standard neural networks may perform better than Neural ODEs. Firstly, conventional nets can fit non-homeomorphic functions, such as functions whose output has a smaller dimension than their input or that change the input space's topology. However, this could be handled by composing ODE nets with standard network layers. Another point is that conventional nets can be evaluated precisely with a fixed amount of computation, are typically faster to train, and do not require an error tolerance for a solver.

== Conclusions and Critiques ==

We covered the use of black-box ODE solvers as a model component and their application to initial value problems constructed from real applications. Neural ODE Networks show promising gains in computational cost without large sacrifices in accuracy when applied to certain problems. A drawback of some of these implementations is that the ODE Neural Networks are limited by the underlying distributions of the problems they are trying to solve (requirement of Lipschitz continuity, etc.). There are plenty of further advances to be made in this field as hundreds of years of ODE theory and literature is available, so this is currently an important area of research.

ODEs indeed represent an important area of applied mathematics where neural networks can be used to solve them numerically. Perhaps, a parallel area of investigation can be PDEs (Partial Differential Equations). PDEs are also widely encountered in many areas of applied mathematics, physics, social sciences, and many other fields. It will be interesting to see how neural networks can be used to solve PDEs.

== More Critiques ==
Table 1 shows a comparison between different implementations which is very helpful. We can see from the table that the 1-Layer MLP has the largest test error and the one with the best performance should be ODE-Net. Although it doesn't have the lowest test error (the test error of ODE-Net is 0.42% and the lowest test error is 0.41% for ResNet), it still has the least number of parameters, memory, and time. This convinced us that it can be widely used in other applications.

For the last paragraph in the scope and limitations section, I guess the author wants to use the word "than" instead of using "that" in the sentence "for example, functions whose output has a smaller dimension that their input, or that change the topology of the input space."

This paper covers the memory efficiency of Neural ODE Networks, but does not address runtime. In practice, most systems are bound by latency requirements more-so than memory requirements (except in edge device cases). Though it may be unreasonable to expect the authors to produce a performance-optimized implementation, it would be insightful to understand the computational bottlenecks so existing frameworks can take steps to address them. This model looks promising and practical performance is the key to enabling future research in this.

The above critique also questions the need for a neural network for such a problem. This problem was studied by Brunel et al. and they presented their solution in their paper ''Parametric Estimation of Ordinary Differential Equations with Orthogonality Conditions''. While this solution also requires iteratively solving a complex optimization problem, they did not require the massive memory and runtime overhead of a neural network. For the neural network solution to demonstrate its potential, it should be including experimental comparisons with specialized ordinary differential equation algorithms instead of simply comparing with a general recurrent neural network.

Table 2 shows that potential ODEs have lower predicted RMSE, and more relevant information should be provided. For example, the reason of setting the n to 30/50/100 can be simply described. And it is good to talk more about the performance of Latent ODE since the change of n does not have much impact on its RMSE.

== References ==
Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. ''arXiv preprint arXiv'':1710.10121, 2017.

Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. ''Inverse Problems'', 34 (1):014004, 2017.

Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. ''arXiv preprint arXiv'':1804.04272, 2018.

Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. ''The mathematical theory of optimal processes''. 1962.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ''European conference on computer vision'', pages 630–645. Springer, 2016b.

Earl A Coddington and Norman Levinson. ''Theory of ordinary differential equations''. Tata McGrawHill Education, 1955.

Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. ''arXiv preprint arXiv:1505.05770'', 2015.

Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. ''arXiv preprint arXiv:1410.8516'', 2014.

Brunel, N. J., Clairon, Q., & d’Alché-Buc, F. (2014). Parametric estimation of ordinary differential equations with orthogonality conditions. ''Journal of the American Statistical Association'', 109(505), 173-185.

A. Iserles. A first course in the numerical analysis of differential equations. Cambridge Texts in Applied Mathematics, second edition, 2009

A universal SNP and small-indel variant caller using deep neural networks

2020-12-04T06:24:31Z

Z42qin: /* Critique and Discussion */

== Background ==

Genes determine Biological functions, and mutants or alleles(one of two or more alternative forms of a gene that arise by mutation and are found at the same place on a chromosome) of those genes determine differences within a function. Determining novel alleles is very important in understanding the genetic variation within a species. For example, different alleles of the gene OCA2 determine eyes' color. All animals receive one copy of each gene from each of their parents. Mutations of a gene are classified as either homozygous (both copies are the same) or heterozygous (the two copies are different).

Next-generation sequencing is a prevalent technique for sequencing or reading DNA. Since all genes are encoded as DNA, sequencing is an essential tool for understanding genes. Next-generation sequencing works by reading short sections of DNA of length k, called k-means, and then piecing them together or aligning them to a reference genome. Next-generation sequencing is relatively fast and inexpensive, although it can randomly misidentify some nucleotides, introducing errors. However, NGS reading is errorful and arises from a complex error process depending on various factors.

The process of variant calling is determining novel alleles from sequencing data (typically next-generation sequencing data). Some significant alleles only differ from the "standard" version of a gene by only a single base pair, such as the mutation which causes multiple sclerosis. Therefore it is crucial to accurately call single nucleotide swaps/polymorphisms (SNPs), insertions, and deletions (indels). Calling SNPs and small indels are technically challenging since it requires a program to distinguish between genuinely novel mutations and errors in the sequencing data.

Previous approaches usually involved using various statistical techniques. A widely used one is GATK. GATK uses a combination of logistic regression, hidden Markov models, naive Bayes classification, and Gaussian mixture models to perform the process [2]. However, these methods have their weaknesses as some assumptions do not hold (i.e., independence assumptions), and it's hard to generalize them to other sequencing technologies.

This paper aims to solve the problem of calling SNPs and small indels using a convolutional neural net by casting the reads as images and classifying whether they contain a mutation. It introduces a variant caller called "DeepVariant", which requires no specialized knowledge, but performs better than previous state-of-art methods.

== Overview ==

In Figure 1, the DeepVariant workflow overview is illustrated.

[[File:figure 111.JPG|Figure 1. In all panels, blue boxes represent data and red boxes are processes]]

Initially, the NGS reads aligned to a reference genome are scanned for candidate variants which are different sites from the reference genome. The read and reference data are encoded as an image for each candidate variant site. Then, trained CNN can compute the genotype likelihoods, (heterozygous or homozygous) for each of the candidate variants (figure1, left box).
To train the CNN for image classification purposes, the DeepVariant machinery makes pileup images for a labeled sample with known genotypes. These labeled images and known genotypes are provided to CNN for training, and a stochastic gradient descent algorithm is used to optimize the CNN parameters to maximize genotype prediction accuracy. After the convergence of the model, the final model is frozen to use for calling mutations for other image classification tests (figure1, middle box).
For example, in figure 1 (right box), the reference and read bases are encoded into a pileup image at a candidate variant site. CNN using this encoded image computes the genotype likelihoods for the three diploid genotype states of homozygous reference (hom-ref), heterozygous (het) or homozygous alternate (hom-alt). In this example, a heterozygous variant call is emitted, as the most probable genotype here is “het”.

== Preprocessing ==

Before the sequencing reads can be fed into the classifier, they must be pre-processed. There are many pre-processing steps that are necessary for this algorithm. These steps represent the real novelty in this technique by transforming the data to allow us to use more common neural network architectures for classification. The pre-processing of the data can be broken into three phases: the realignment of reads, finding candidate variants and creating the candidate variants' images.

The realignment of the pre-processing reads phase is essential to ensure the sequences can be adequately compared to the reference sequences. First, the sequences are aligned to a reference sequence. Reads that align poorly are grouped with other reads around them to build that section, or haplotype, from scratch. If there is strong evidence that the new version of the haplotype fits the reads well, the reads are re-aligned. This process updates the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string to represent a sequence's alignment to a reference for each read.

Once the reads are correctly aligned, the algorithm then proceeds to find candidate variants, regions in the DNA sequence containing variants. It is these candidate variants that will eventually be passed as input to the neural network. To find these, we need to consider each position in the reference sequence independently. Any unusable reads are filtered at this point. This includes reads that are not appropriately aligned, marked as duplicates, those that fail vendor quality checks, or whose mapping quality is less than ten. For each site in the genome, we collect all the remaining reads that overlap that site. The corresponding allele aligned to that site is then determined by decoding the CIGAR string, which was updated in each read's realignment phase. The alleles are then classified into one of four categories: reference-matching base, reference-mismatching base, insertion with a specific sequence, or deletion with a specific length, and the number of occurrences of each distinct allele across all reads is counted. Read bases are only included as potential alleles if each base in the allele has a quality score of at least 10.

The last phase of pre-processing is to convert these candidate variants into images representing the data with candidate variants identified. This allows for the use of well established convolutional neural networks for image classification for this technical problem. Each color channel is used to store a different piece of information about a candidate variant. The red channel encodes which base we have (A, G, C, or T) by mapping each base to a particular value. The quality of the read is mapped to the green color channel.

Moreover, the blue channel encodes whether or not the reference is on the positive strand of the DNA. Each row of the image represents a read, and each column represents a particular base in that read. The reference strand is repeated for the first five rows of the encoded image to maintain its information after a 5x5 convolution is applied. With the data pre-processing complete, the images can then be passed into the neural network for classification.

== Neural Network ==

The neural network used is a convolutional neural network. Although the full network architecture is not revealed in the paper, there are several details which we can discuss. The architecture of the network is an input layer attached to an adapted Inception v2 ImageNet model with nine partitions. The inception v2 model in particular uses a series of CNNs. One interesting aspect about the Inception model is that rather than optimizing a series of hyperparameters in order to determine the most optimal parameter configuration, Inception instead concatenates a series of different sizes of filters on the same layer, which acts to learn the best architecture out of these concatenated filters. The input layer takes as input the images representing the candidate variants and rescales them to 299x299 pixels. The output layer is a three-class Softmax layer initialized with Gaussian random weights with a standard deviation of 0.001. This final layer is fully connected to the previous layer. The three classes are the homozygous reference (meaning it is not a variant), heterozygous variant, and homozygous variant. The candidate variant is classified into the class with the highest probability. The model is trained using stochastic gradient descent with a weight decay of 0.00004. The training was done in mini-batches, each with 32 images, using a root mean squared (RMS) decay of 0.9. For the multiple sequencing technologies experiments, a single model was trained with a learning rate of 0.0015 and momentum 0.8 for 250,000 update steps. For all other experiments, multiple models were trained, and the one with the highest accuracy on the training set was chosen as the final model. The multiple models stem from using each combination of the possible parameter values for the learning rate (0.00095, 0.001, 0.0015) and momentum (0.8, 0.85, 0.9). These models were trained for 80 hours, or until the training accuracy converged.

== Results ==

DeepVariant was trained using data available from the CEPH (Centre d’Etude du Polymorphism Humain) female sample NA12878 and was evaluated on the unseen Ashkenazi male sample NA24385. The results were compared with other most commonly used bioinformatics methods, such as the GATK, FreeBayes22, SAMtools23, 16GT24 and Strelka25 (Table 1). For better comparison, the overall accuracy (F1), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are illustrated over the whole genome.

[[File:table 11.JPG]]

DeepVariant showed the highest accuracy and more than 50% fewer errors per genome compared to the next best algorithm.

They also evaluated the same set of algorithms using the synthetic diploid sample CHM1-CHM1326 (Table 2).

[[File:Table 333.JPG]]

Results illustrated that the DeepVariant method outperformed all other algorithms for variant calling (SNP and indel) and showed the highest accuracy in terms of F1, Recall, precision and TP.

== Conclusion ==

This endeavor to further advance a data-centric approach to understanding the gene sequence illustrates the advantages of deep learning over humans. With billions of DNA base pairs, no humans can digest that amount of gene expressions. In the past, computational techniques are unfeasible due to the lack of computing power, but in the 21st century, it seems that machine learning is the way to go for molecular biology.

DeepVariant’s strong performance on human data proves that deep learning is a promising technique for variant calling. Perhaps the most exciting feature of DeepVariant is its simplicity. Unlike other states of the art variant callers, DeepVariant does not know the sequencing technologies that create the reads or even the biological processes that introduce mutations. It simplifies the problem of variant calling to preprocessing the reads and training a generic deep learning model. It also suggests that DeepVariant could be significantly improved by tailoring the preprocessing to specific sequencing technologies and developing a dedicated CNN architecture for the reads, rather than casting them as images.

== Critique and Discussion==

The paper presents an attractive method for solving a significant problem. Building "images" of reads and running them through a generic image classification CNN seems like a strange approach, and, interestingly, it works well. The most significant issues with the paper are the lack of specific information about how the methods. Some extra information is included in the supplementary material, but there are still some significant gaps. In particular:

1. What is the structure of the neural net? How many layers, and what sizes? The paper for ConvNet does not have this information. We suspect that this might be a trade secret that Google is protecting.

2. How is the realignment step implemented? The paper mentions that it uses a "De-Bruijn-graph-based read assembly procedure" to realign reads to a new haplotype. It is a non-standard step in most genomics workflows, yet the paper does not describe how they do the realignment or build the haplotypes.

3. How did they settle on the image construction algorithm? The authors provide pseudocode for the construction of pileup images, but they do not describe how to make decisions. For instance, the color values for different base pairs are not evenly spaced. Also, the image begins with five rows of the reference genome.

One thing we appreciated about the paper was their commentary on future developments. The authors clarify that this approach can be improved on and provide specific ideas for the next steps.

Overall, the paper presents an interesting idea with strong results but lacks detail in some vital implementation pieces.

The topic of this project is good, but we need more details on the algorithm. In the neural network part, the details are not enough; authors should provide a figure to explain better how the model works and the model's structure. Otherwise, we cannot understand how the model works. When we preprocess the data if different data have different lengths, shall we add more information or drop some information to match?

5 Particularly, which package did the researchers use to perform this analysis? Different packages of deep learning can have different accuracy and efficiency while making predictions on this data set.

Further studies on DeepVariant [https://www.nature.com/articles/s41598-018-36177-7 have shown] that it is a framework with great potential and sets the medical standard genetics field.

6. It was mentioned that part of the network used is an "adapted Inception v2 ImageNet model" - does this mean that it used an Inception v2 model that was trained on ImageNet? This is not clear, but if this is the case, then why is this useful? Why would the features that are extracted for an image be useful for genomics? Did they try using a model that was not trained? Also, they describe the preprocessing method used but were there any alternatives that they considered?

7. A more extensive discussion on the "Neural Network" section can be given. For example, the paragraph's last sentence says that "Models are trained for 80 hours, or until the training accuracy converged." This sentence implies that if the training accuracy does converge, it usually takes less than 80 hours. More exciting data can thus be presented about the training accuracy converging time. How often do the models converge? How long does it take for the models to converge on average? Moreover, why is the number 80 chosen here? Is it a random upper bound set by people to bound the runtime of the model training, or is the number 80 carefully chosen so that it is twice/three times/ten times the average training accuracy converging time?

== References ==
[1] Hartwell, L.H. ''et. al.'' ''Genetics: From Genes to Genomes''. (McGraw-Hill Ryerson, 2014).

[2] Poplin, R. ''et. al''. A universal SNP and small-indel variant caller using deep neural networks. ''Nature Biotechnology'' '''36''', 983-987 (2018).

A universal SNP and small-indel variant caller using deep neural networks

2020-12-04T06:20:40Z

Z42qin: /* Conclusion */

== Background ==

Genes determine Biological functions, and mutants or alleles(one of two or more alternative forms of a gene that arise by mutation and are found at the same place on a chromosome) of those genes determine differences within a function. Determining novel alleles is very important in understanding the genetic variation within a species. For example, different alleles of the gene OCA2 determine eyes' color. All animals receive one copy of each gene from each of their parents. Mutations of a gene are classified as either homozygous (both copies are the same) or heterozygous (the two copies are different).

Next-generation sequencing is a prevalent technique for sequencing or reading DNA. Since all genes are encoded as DNA, sequencing is an essential tool for understanding genes. Next-generation sequencing works by reading short sections of DNA of length k, called k-means, and then piecing them together or aligning them to a reference genome. Next-generation sequencing is relatively fast and inexpensive, although it can randomly misidentify some nucleotides, introducing errors. However, NGS reading is errorful and arises from a complex error process depending on various factors.

The process of variant calling is determining novel alleles from sequencing data (typically next-generation sequencing data). Some significant alleles only differ from the "standard" version of a gene by only a single base pair, such as the mutation which causes multiple sclerosis. Therefore it is crucial to accurately call single nucleotide swaps/polymorphisms (SNPs), insertions, and deletions (indels). Calling SNPs and small indels are technically challenging since it requires a program to distinguish between genuinely novel mutations and errors in the sequencing data.

Previous approaches usually involved using various statistical techniques. A widely used one is GATK. GATK uses a combination of logistic regression, hidden Markov models, naive Bayes classification, and Gaussian mixture models to perform the process [2]. However, these methods have their weaknesses as some assumptions do not hold (i.e., independence assumptions), and it's hard to generalize them to other sequencing technologies.

This paper aims to solve the problem of calling SNPs and small indels using a convolutional neural net by casting the reads as images and classifying whether they contain a mutation. It introduces a variant caller called "DeepVariant", which requires no specialized knowledge, but performs better than previous state-of-art methods.

== Overview ==

In Figure 1, the DeepVariant workflow overview is illustrated.

[[File:figure 111.JPG|Figure 1. In all panels, blue boxes represent data and red boxes are processes]]

Initially, the NGS reads aligned to a reference genome are scanned for candidate variants which are different sites from the reference genome. The read and reference data are encoded as an image for each candidate variant site. Then, trained CNN can compute the genotype likelihoods, (heterozygous or homozygous) for each of the candidate variants (figure1, left box).
To train the CNN for image classification purposes, the DeepVariant machinery makes pileup images for a labeled sample with known genotypes. These labeled images and known genotypes are provided to CNN for training, and a stochastic gradient descent algorithm is used to optimize the CNN parameters to maximize genotype prediction accuracy. After the convergence of the model, the final model is frozen to use for calling mutations for other image classification tests (figure1, middle box).
For example, in figure 1 (right box), the reference and read bases are encoded into a pileup image at a candidate variant site. CNN using this encoded image computes the genotype likelihoods for the three diploid genotype states of homozygous reference (hom-ref), heterozygous (het) or homozygous alternate (hom-alt). In this example, a heterozygous variant call is emitted, as the most probable genotype here is “het”.

== Preprocessing ==

Before the sequencing reads can be fed into the classifier, they must be pre-processed. There are many pre-processing steps that are necessary for this algorithm. These steps represent the real novelty in this technique by transforming the data to allow us to use more common neural network architectures for classification. The pre-processing of the data can be broken into three phases: the realignment of reads, finding candidate variants and creating the candidate variants' images.

The realignment of the pre-processing reads phase is essential to ensure the sequences can be adequately compared to the reference sequences. First, the sequences are aligned to a reference sequence. Reads that align poorly are grouped with other reads around them to build that section, or haplotype, from scratch. If there is strong evidence that the new version of the haplotype fits the reads well, the reads are re-aligned. This process updates the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string to represent a sequence's alignment to a reference for each read.

Once the reads are correctly aligned, the algorithm then proceeds to find candidate variants, regions in the DNA sequence containing variants. It is these candidate variants that will eventually be passed as input to the neural network. To find these, we need to consider each position in the reference sequence independently. Any unusable reads are filtered at this point. This includes reads that are not appropriately aligned, marked as duplicates, those that fail vendor quality checks, or whose mapping quality is less than ten. For each site in the genome, we collect all the remaining reads that overlap that site. The corresponding allele aligned to that site is then determined by decoding the CIGAR string, which was updated in each read's realignment phase. The alleles are then classified into one of four categories: reference-matching base, reference-mismatching base, insertion with a specific sequence, or deletion with a specific length, and the number of occurrences of each distinct allele across all reads is counted. Read bases are only included as potential alleles if each base in the allele has a quality score of at least 10.

The last phase of pre-processing is to convert these candidate variants into images representing the data with candidate variants identified. This allows for the use of well established convolutional neural networks for image classification for this technical problem. Each color channel is used to store a different piece of information about a candidate variant. The red channel encodes which base we have (A, G, C, or T) by mapping each base to a particular value. The quality of the read is mapped to the green color channel.

Moreover, the blue channel encodes whether or not the reference is on the positive strand of the DNA. Each row of the image represents a read, and each column represents a particular base in that read. The reference strand is repeated for the first five rows of the encoded image to maintain its information after a 5x5 convolution is applied. With the data pre-processing complete, the images can then be passed into the neural network for classification.

== Neural Network ==

The neural network used is a convolutional neural network. Although the full network architecture is not revealed in the paper, there are several details which we can discuss. The architecture of the network is an input layer attached to an adapted Inception v2 ImageNet model with nine partitions. The inception v2 model in particular uses a series of CNNs. One interesting aspect about the Inception model is that rather than optimizing a series of hyperparameters in order to determine the most optimal parameter configuration, Inception instead concatenates a series of different sizes of filters on the same layer, which acts to learn the best architecture out of these concatenated filters. The input layer takes as input the images representing the candidate variants and rescales them to 299x299 pixels. The output layer is a three-class Softmax layer initialized with Gaussian random weights with a standard deviation of 0.001. This final layer is fully connected to the previous layer. The three classes are the homozygous reference (meaning it is not a variant), heterozygous variant, and homozygous variant. The candidate variant is classified into the class with the highest probability. The model is trained using stochastic gradient descent with a weight decay of 0.00004. The training was done in mini-batches, each with 32 images, using a root mean squared (RMS) decay of 0.9. For the multiple sequencing technologies experiments, a single model was trained with a learning rate of 0.0015 and momentum 0.8 for 250,000 update steps. For all other experiments, multiple models were trained, and the one with the highest accuracy on the training set was chosen as the final model. The multiple models stem from using each combination of the possible parameter values for the learning rate (0.00095, 0.001, 0.0015) and momentum (0.8, 0.85, 0.9). These models were trained for 80 hours, or until the training accuracy converged.

== Results ==

DeepVariant was trained using data available from the CEPH (Centre d’Etude du Polymorphism Humain) female sample NA12878 and was evaluated on the unseen Ashkenazi male sample NA24385. The results were compared with other most commonly used bioinformatics methods, such as the GATK, FreeBayes22, SAMtools23, 16GT24 and Strelka25 (Table 1). For better comparison, the overall accuracy (F1), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are illustrated over the whole genome.

[[File:table 11.JPG]]

DeepVariant showed the highest accuracy and more than 50% fewer errors per genome compared to the next best algorithm.

They also evaluated the same set of algorithms using the synthetic diploid sample CHM1-CHM1326 (Table 2).

[[File:Table 333.JPG]]

Results illustrated that the DeepVariant method outperformed all other algorithms for variant calling (SNP and indel) and showed the highest accuracy in terms of F1, Recall, precision and TP.

== Conclusion ==

This endeavor to further advance a data-centric approach to understanding the gene sequence illustrates the advantages of deep learning over humans. With billions of DNA base pairs, no humans can digest that amount of gene expressions. In the past, computational techniques are unfeasible due to the lack of computing power, but in the 21st century, it seems that machine learning is the way to go for molecular biology.

DeepVariant’s strong performance on human data proves that deep learning is a promising technique for variant calling. Perhaps the most exciting feature of DeepVariant is its simplicity. Unlike other states of the art variant callers, DeepVariant does not know the sequencing technologies that create the reads or even the biological processes that introduce mutations. It simplifies the problem of variant calling to preprocessing the reads and training a generic deep learning model. It also suggests that DeepVariant could be significantly improved by tailoring the preprocessing to specific sequencing technologies and developing a dedicated CNN architecture for the reads, rather than casting them as images.

== Critique and Discussion==

The paper presents an interesting method for solving an important problem. Building "images" of reads and running them through a generic image classification CNN seems like a strange approach, and it is interesting that it works well. The biggest issues with the paper are the lack of specific information about how the methods. Some extra information is included in the supplementary material, but there are still some big gaps. In particular:

1. What is the structure of the neural net? How many layers, and what sizes? The paper for ConvNet which is cited does not have this information. We suspect that this might be a trade secret that Google is protecting.

2. How is the realignment step implemented? The paper mentions that it uses a "De-Bruijn-graph-based read assembly procedure" to realign reads to a new haplotype. This is a non-standard step in most genomics workflows yet the paper does not describe how they do the realignment or how they build the haplotypes.

3. How did they settle on the image construction algorithm? The authors provide pseudocode for the construction of pileup images but they do not describe how the decisions for made. For instance, the colour values for different base pairs are not evenly spaced. Also, the image begins with 5 rows of the reference genome.

One thing we appreciated about the paper was their commentary on future developments. The authors make it very clear that this approach can be improved on and provide specific ideas for next steps.

Overall, the paper presents an interesting idea with strong results, but lacks detail in some key pieces of the implementation.

The topic of this project is good but we need to more details of the algorithm. In the neural network part, the details are not enough, Authors should provide a figure to better explain how the model works and the structure of the model. Otherwise we cannot understand how the model works. Also, when we are preprocessing the data, if different data have different lengths, shall we add more information or drop some information so they match?

5 Particularly, which package did the researchers use to perform this analysis? Different packages of deep learning can have different accuracy and efficiency while making predictions on this data set.

Further studies on DeepVariant [https://www.nature.com/articles/s41598-018-36177-7 have shown] that it is a framework with great potential and sets the standard in the medical genetics field.

6. It was mentioned that part of the network used is an "adapted Inception v2 ImageNet model" - does this mean that it used an Inception v2 model that was pretrained on ImageNet? This is not clear, but if this is the case then why is this useful? Why would the features that are extracted for an image be useful for genomics? Did they try using a model that wasn't pretrained? Also, they describe the preprocessing method used but were there any alternatives that they considered?

7. A more extensive discussion on the "Neural Network" section can be give. For example, the last sentence of the paragraph says that "Models are trained for 80 hours, or until the training accuracy converged." This sentence implies that if the training accuracy does converge, it usually takes less than 80 hours. More interesting data can thus be presented about the training accuracy converging time. How often do the models converge? How long does it take for the models to converge on average? Moreover, why is the number 80 chosen here? Is it a random upper bound set by people to bound the runtime of the model training, or is the number 80 carefully chosen so that it's twice/three times/ten times of the average training accuracy converging time? I believe that these are interesting facts to look at.

== References ==
[1] Hartwell, L.H. ''et. al.'' ''Genetics: From Genes to Genomes''. (McGraw-Hill Ryerson, 2014).

[2] Poplin, R. ''et. al''. A universal SNP and small-indel variant caller using deep neural networks. ''Nature Biotechnology'' '''36''', 983-987 (2018).

A universal SNP and small-indel variant caller using deep neural networks

2020-12-04T06:18:15Z

Z42qin: /* Preprocessing */

== Background ==

Genes determine Biological functions, and mutants or alleles(one of two or more alternative forms of a gene that arise by mutation and are found at the same place on a chromosome) of those genes determine differences within a function. Determining novel alleles is very important in understanding the genetic variation within a species. For example, different alleles of the gene OCA2 determine eyes' color. All animals receive one copy of each gene from each of their parents. Mutations of a gene are classified as either homozygous (both copies are the same) or heterozygous (the two copies are different).

Next-generation sequencing is a prevalent technique for sequencing or reading DNA. Since all genes are encoded as DNA, sequencing is an essential tool for understanding genes. Next-generation sequencing works by reading short sections of DNA of length k, called k-means, and then piecing them together or aligning them to a reference genome. Next-generation sequencing is relatively fast and inexpensive, although it can randomly misidentify some nucleotides, introducing errors. However, NGS reading is errorful and arises from a complex error process depending on various factors.

The process of variant calling is determining novel alleles from sequencing data (typically next-generation sequencing data). Some significant alleles only differ from the "standard" version of a gene by only a single base pair, such as the mutation which causes multiple sclerosis. Therefore it is crucial to accurately call single nucleotide swaps/polymorphisms (SNPs), insertions, and deletions (indels). Calling SNPs and small indels are technically challenging since it requires a program to distinguish between genuinely novel mutations and errors in the sequencing data.

Previous approaches usually involved using various statistical techniques. A widely used one is GATK. GATK uses a combination of logistic regression, hidden Markov models, naive Bayes classification, and Gaussian mixture models to perform the process [2]. However, these methods have their weaknesses as some assumptions do not hold (i.e., independence assumptions), and it's hard to generalize them to other sequencing technologies.

This paper aims to solve the problem of calling SNPs and small indels using a convolutional neural net by casting the reads as images and classifying whether they contain a mutation. It introduces a variant caller called "DeepVariant", which requires no specialized knowledge, but performs better than previous state-of-art methods.

== Overview ==

In Figure 1, the DeepVariant workflow overview is illustrated.

[[File:figure 111.JPG|Figure 1. In all panels, blue boxes represent data and red boxes are processes]]

Initially, the NGS reads aligned to a reference genome are scanned for candidate variants which are different sites from the reference genome. The read and reference data are encoded as an image for each candidate variant site. Then, trained CNN can compute the genotype likelihoods, (heterozygous or homozygous) for each of the candidate variants (figure1, left box).
To train the CNN for image classification purposes, the DeepVariant machinery makes pileup images for a labeled sample with known genotypes. These labeled images and known genotypes are provided to CNN for training, and a stochastic gradient descent algorithm is used to optimize the CNN parameters to maximize genotype prediction accuracy. After the convergence of the model, the final model is frozen to use for calling mutations for other image classification tests (figure1, middle box).
For example, in figure 1 (right box), the reference and read bases are encoded into a pileup image at a candidate variant site. CNN using this encoded image computes the genotype likelihoods for the three diploid genotype states of homozygous reference (hom-ref), heterozygous (het) or homozygous alternate (hom-alt). In this example, a heterozygous variant call is emitted, as the most probable genotype here is “het”.

== Preprocessing ==

Before the sequencing reads can be fed into the classifier, they must be pre-processed. There are many pre-processing steps that are necessary for this algorithm. These steps represent the real novelty in this technique by transforming the data to allow us to use more common neural network architectures for classification. The pre-processing of the data can be broken into three phases: the realignment of reads, finding candidate variants and creating the candidate variants' images.

The realignment of the pre-processing reads phase is essential to ensure the sequences can be adequately compared to the reference sequences. First, the sequences are aligned to a reference sequence. Reads that align poorly are grouped with other reads around them to build that section, or haplotype, from scratch. If there is strong evidence that the new version of the haplotype fits the reads well, the reads are re-aligned. This process updates the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string to represent a sequence's alignment to a reference for each read.

Once the reads are correctly aligned, the algorithm then proceeds to find candidate variants, regions in the DNA sequence containing variants. It is these candidate variants that will eventually be passed as input to the neural network. To find these, we need to consider each position in the reference sequence independently. Any unusable reads are filtered at this point. This includes reads that are not appropriately aligned, marked as duplicates, those that fail vendor quality checks, or whose mapping quality is less than ten. For each site in the genome, we collect all the remaining reads that overlap that site. The corresponding allele aligned to that site is then determined by decoding the CIGAR string, which was updated in each read's realignment phase. The alleles are then classified into one of four categories: reference-matching base, reference-mismatching base, insertion with a specific sequence, or deletion with a specific length, and the number of occurrences of each distinct allele across all reads is counted. Read bases are only included as potential alleles if each base in the allele has a quality score of at least 10.

The last phase of pre-processing is to convert these candidate variants into images representing the data with candidate variants identified. This allows for the use of well established convolutional neural networks for image classification for this technical problem. Each color channel is used to store a different piece of information about a candidate variant. The red channel encodes which base we have (A, G, C, or T) by mapping each base to a particular value. The quality of the read is mapped to the green color channel.

Moreover, the blue channel encodes whether or not the reference is on the positive strand of the DNA. Each row of the image represents a read, and each column represents a particular base in that read. The reference strand is repeated for the first five rows of the encoded image to maintain its information after a 5x5 convolution is applied. With the data pre-processing complete, the images can then be passed into the neural network for classification.

== Neural Network ==

The neural network used is a convolutional neural network. Although the full network architecture is not revealed in the paper, there are several details which we can discuss. The architecture of the network is an input layer attached to an adapted Inception v2 ImageNet model with nine partitions. The inception v2 model in particular uses a series of CNNs. One interesting aspect about the Inception model is that rather than optimizing a series of hyperparameters in order to determine the most optimal parameter configuration, Inception instead concatenates a series of different sizes of filters on the same layer, which acts to learn the best architecture out of these concatenated filters. The input layer takes as input the images representing the candidate variants and rescales them to 299x299 pixels. The output layer is a three-class Softmax layer initialized with Gaussian random weights with a standard deviation of 0.001. This final layer is fully connected to the previous layer. The three classes are the homozygous reference (meaning it is not a variant), heterozygous variant, and homozygous variant. The candidate variant is classified into the class with the highest probability. The model is trained using stochastic gradient descent with a weight decay of 0.00004. The training was done in mini-batches, each with 32 images, using a root mean squared (RMS) decay of 0.9. For the multiple sequencing technologies experiments, a single model was trained with a learning rate of 0.0015 and momentum 0.8 for 250,000 update steps. For all other experiments, multiple models were trained, and the one with the highest accuracy on the training set was chosen as the final model. The multiple models stem from using each combination of the possible parameter values for the learning rate (0.00095, 0.001, 0.0015) and momentum (0.8, 0.85, 0.9). These models were trained for 80 hours, or until the training accuracy converged.

== Results ==

DeepVariant was trained using data available from the CEPH (Centre d’Etude du Polymorphism Humain) female sample NA12878 and was evaluated on the unseen Ashkenazi male sample NA24385. The results were compared with other most commonly used bioinformatics methods, such as the GATK, FreeBayes22, SAMtools23, 16GT24 and Strelka25 (Table 1). For better comparison, the overall accuracy (F1), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are illustrated over the whole genome.

[[File:table 11.JPG]]

DeepVariant showed the highest accuracy and more than 50% fewer errors per genome compared to the next best algorithm.

They also evaluated the same set of algorithms using the synthetic diploid sample CHM1-CHM1326 (Table 2).

[[File:Table 333.JPG]]

Results illustrated that the DeepVariant method outperformed all other algorithms for variant calling (SNP and indel) and showed the highest accuracy in terms of F1, Recall, precision and TP.

== Conclusion ==

This endeavour to further advance a data-centric approach to understanding the gene sequence illustrate the advantages of deep learning over humans. With billions of DNA base pairs, no humans are able to digest that amount of gene expressions. In the past, computational techniques are unfeasible due to the lack of compute power but in the 21st century, it seems that machine learning is the way to go for molecular biology.

DeepVariant’s strong performance on human data proves that deep learning is a promising technique for variant calling. Perhaps the most exciting feature of DeepVariant is its simplicity. Unlike other states of the art variant callers, DeepVariant has no knowledge of the sequencing technologies that create the reads, or even the biological processes that introduce mutations. This simplifies the problem of variant calling to preprocessing the reads and training a generic deep learning model. It also suggests that DeepVariant could be significantly improved by tailoring the preprocessing to specific sequencing technologies and/or developing a dedicated CNN architecture for the reads, rather than trying to cast them as images.

== Critique and Discussion==

The paper presents an interesting method for solving an important problem. Building "images" of reads and running them through a generic image classification CNN seems like a strange approach, and it is interesting that it works well. The biggest issues with the paper are the lack of specific information about how the methods. Some extra information is included in the supplementary material, but there are still some big gaps. In particular:

1. What is the structure of the neural net? How many layers, and what sizes? The paper for ConvNet which is cited does not have this information. We suspect that this might be a trade secret that Google is protecting.

2. How is the realignment step implemented? The paper mentions that it uses a "De-Bruijn-graph-based read assembly procedure" to realign reads to a new haplotype. This is a non-standard step in most genomics workflows yet the paper does not describe how they do the realignment or how they build the haplotypes.

3. How did they settle on the image construction algorithm? The authors provide pseudocode for the construction of pileup images but they do not describe how the decisions for made. For instance, the colour values for different base pairs are not evenly spaced. Also, the image begins with 5 rows of the reference genome.

One thing we appreciated about the paper was their commentary on future developments. The authors make it very clear that this approach can be improved on and provide specific ideas for next steps.

Overall, the paper presents an interesting idea with strong results, but lacks detail in some key pieces of the implementation.

The topic of this project is good but we need to more details of the algorithm. In the neural network part, the details are not enough, Authors should provide a figure to better explain how the model works and the structure of the model. Otherwise we cannot understand how the model works. Also, when we are preprocessing the data, if different data have different lengths, shall we add more information or drop some information so they match?

5 Particularly, which package did the researchers use to perform this analysis? Different packages of deep learning can have different accuracy and efficiency while making predictions on this data set.

Further studies on DeepVariant [https://www.nature.com/articles/s41598-018-36177-7 have shown] that it is a framework with great potential and sets the standard in the medical genetics field.

6. It was mentioned that part of the network used is an "adapted Inception v2 ImageNet model" - does this mean that it used an Inception v2 model that was pretrained on ImageNet? This is not clear, but if this is the case then why is this useful? Why would the features that are extracted for an image be useful for genomics? Did they try using a model that wasn't pretrained? Also, they describe the preprocessing method used but were there any alternatives that they considered?

7. A more extensive discussion on the "Neural Network" section can be give. For example, the last sentence of the paragraph says that "Models are trained for 80 hours, or until the training accuracy converged." This sentence implies that if the training accuracy does converge, it usually takes less than 80 hours. More interesting data can thus be presented about the training accuracy converging time. How often do the models converge? How long does it take for the models to converge on average? Moreover, why is the number 80 chosen here? Is it a random upper bound set by people to bound the runtime of the model training, or is the number 80 carefully chosen so that it's twice/three times/ten times of the average training accuracy converging time? I believe that these are interesting facts to look at.

== References ==
[1] Hartwell, L.H. ''et. al.'' ''Genetics: From Genes to Genomes''. (McGraw-Hill Ryerson, 2014).

[2] Poplin, R. ''et. al''. A universal SNP and small-indel variant caller using deep neural networks. ''Nature Biotechnology'' '''36''', 983-987 (2018).

A universal SNP and small-indel variant caller using deep neural networks

2020-12-04T06:15:01Z

Z42qin: /* Background */

== Background ==

Genes determine Biological functions, and mutants or alleles(one of two or more alternative forms of a gene that arise by mutation and are found at the same place on a chromosome) of those genes determine differences within a function. Determining novel alleles is very important in understanding the genetic variation within a species. For example, different alleles of the gene OCA2 determine eyes' color. All animals receive one copy of each gene from each of their parents. Mutations of a gene are classified as either homozygous (both copies are the same) or heterozygous (the two copies are different).

Next-generation sequencing is a prevalent technique for sequencing or reading DNA. Since all genes are encoded as DNA, sequencing is an essential tool for understanding genes. Next-generation sequencing works by reading short sections of DNA of length k, called k-means, and then piecing them together or aligning them to a reference genome. Next-generation sequencing is relatively fast and inexpensive, although it can randomly misidentify some nucleotides, introducing errors. However, NGS reading is errorful and arises from a complex error process depending on various factors.

The process of variant calling is determining novel alleles from sequencing data (typically next-generation sequencing data). Some significant alleles only differ from the "standard" version of a gene by only a single base pair, such as the mutation which causes multiple sclerosis. Therefore it is crucial to accurately call single nucleotide swaps/polymorphisms (SNPs), insertions, and deletions (indels). Calling SNPs and small indels are technically challenging since it requires a program to distinguish between genuinely novel mutations and errors in the sequencing data.

Previous approaches usually involved using various statistical techniques. A widely used one is GATK. GATK uses a combination of logistic regression, hidden Markov models, naive Bayes classification, and Gaussian mixture models to perform the process [2]. However, these methods have their weaknesses as some assumptions do not hold (i.e., independence assumptions), and it's hard to generalize them to other sequencing technologies.

This paper aims to solve the problem of calling SNPs and small indels using a convolutional neural net by casting the reads as images and classifying whether they contain a mutation. It introduces a variant caller called "DeepVariant", which requires no specialized knowledge, but performs better than previous state-of-art methods.

== Overview ==

In Figure 1, the DeepVariant workflow overview is illustrated.

[[File:figure 111.JPG|Figure 1. In all panels, blue boxes represent data and red boxes are processes]]

Initially, the NGS reads aligned to a reference genome are scanned for candidate variants which are different sites from the reference genome. The read and reference data are encoded as an image for each candidate variant site. Then, trained CNN can compute the genotype likelihoods, (heterozygous or homozygous) for each of the candidate variants (figure1, left box).
To train the CNN for image classification purposes, the DeepVariant machinery makes pileup images for a labeled sample with known genotypes. These labeled images and known genotypes are provided to CNN for training, and a stochastic gradient descent algorithm is used to optimize the CNN parameters to maximize genotype prediction accuracy. After the convergence of the model, the final model is frozen to use for calling mutations for other image classification tests (figure1, middle box).
For example, in figure 1 (right box), the reference and read bases are encoded into a pileup image at a candidate variant site. CNN using this encoded image computes the genotype likelihoods for the three diploid genotype states of homozygous reference (hom-ref), heterozygous (het) or homozygous alternate (hom-alt). In this example, a heterozygous variant call is emitted, as the most probable genotype here is “het”.

== Preprocessing ==

Before the sequencing reads can be fed into the classifier, they must be preprocessed. There are many pre-processing steps that are necessary for this algorithm. These steps represent the real novelty in this technique, by transforming the data in a way that allows us to use more common neural network architectures for classification. The preprocessing of the data can be broken into three main phases: the realignment of reads, finding candidate variants and creating images of the candidate variants.

The realignment of the reads phase of the preprocessing is important in order to ensure the sequences can be properly compared to the reference sequences. First, the sequences are aligned to a reference sequence. Reads that align poorly are grouped with other reads around them to build that section, or haplotype, from scratch. If there is strong evidence that the new version of the haplotype fits the reads well, the reads are re-aligned to it. This process updates the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string, a way to represent the alignment of a sequence to a reference, for each read.

Once the reads are properly aligned, the algorithm then proceeds to find candidate variants, regions in the DNA sequence that may contain variants. It is these candidate variants that will eventually be passed as input to the neural network. To find these, we need to consider each position in the reference sequence independently. Any unusable reads are filtered at this point. This includes reads that are not aligned properly, ones that are marked as duplicates, those that fail vendor quality checks, or whose mapping quality is less than ten. For each site in the genome, we collect all the remaining reads that overlap that site. The corresponding allele aligned to that site is then determined by decoding the CIGAR string, which was updated in the realignment phase, of each read. The alleles are then classified into one of four categories: reference-matching base, reference-mismatching base, insertion with a specific sequence, or deletion with a specific length, and the number of occurrences of each distinct allele across all reads is counted. Read bases are only included as potential alleles if each base in the allele has a quality score of at least 10.

With candidate variants identified, the last phase of pre-processing is to convert these candidate variants into images representing the data. This allows for the use of well established convolutional neural networks for image classification for this specialized problem. Each colour channel is used to store a different piece of information about a candidate variant. The red channel encodes which base we have (A, G, C, or T), by mapping each base to a particular value. The quality of the read is mapped to the green colour channel. And finally, the blue channel encodes whether or not the reference is on the positive strand of the DNA. Each row of the image represents a read, and each column represents a particular base in that read. The reference strand is repeated for the first five rows of the encoded image, in order to maintain its information after a 5x5 convolution is applied.
With the data preprocessing complete, the images can then be passed into the neural network for classification.

== Neural Network ==

The neural network used is a convolutional neural network. Although the full network architecture is not revealed in the paper, there are several details which we can discuss. The architecture of the network is an input layer attached to an adapted Inception v2 ImageNet model with nine partitions. The inception v2 model in particular uses a series of CNNs. One interesting aspect about the Inception model is that rather than optimizing a series of hyperparameters in order to determine the most optimal parameter configuration, Inception instead concatenates a series of different sizes of filters on the same layer, which acts to learn the best architecture out of these concatenated filters. The input layer takes as input the images representing the candidate variants and rescales them to 299x299 pixels. The output layer is a three-class Softmax layer initialized with Gaussian random weights with a standard deviation of 0.001. This final layer is fully connected to the previous layer. The three classes are the homozygous reference (meaning it is not a variant), heterozygous variant, and homozygous variant. The candidate variant is classified into the class with the highest probability. The model is trained using stochastic gradient descent with a weight decay of 0.00004. The training was done in mini-batches, each with 32 images, using a root mean squared (RMS) decay of 0.9. For the multiple sequencing technologies experiments, a single model was trained with a learning rate of 0.0015 and momentum 0.8 for 250,000 update steps. For all other experiments, multiple models were trained, and the one with the highest accuracy on the training set was chosen as the final model. The multiple models stem from using each combination of the possible parameter values for the learning rate (0.00095, 0.001, 0.0015) and momentum (0.8, 0.85, 0.9). These models were trained for 80 hours, or until the training accuracy converged.

== Results ==

DeepVariant was trained using data available from the CEPH (Centre d’Etude du Polymorphism Humain) female sample NA12878 and was evaluated on the unseen Ashkenazi male sample NA24385. The results were compared with other most commonly used bioinformatics methods, such as the GATK, FreeBayes22, SAMtools23, 16GT24 and Strelka25 (Table 1). For better comparison, the overall accuracy (F1), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are illustrated over the whole genome.

[[File:table 11.JPG]]

DeepVariant showed the highest accuracy and more than 50% fewer errors per genome compared to the next best algorithm.

They also evaluated the same set of algorithms using the synthetic diploid sample CHM1-CHM1326 (Table 2).

[[File:Table 333.JPG]]

Results illustrated that the DeepVariant method outperformed all other algorithms for variant calling (SNP and indel) and showed the highest accuracy in terms of F1, Recall, precision and TP.

== Conclusion ==

This endeavour to further advance a data-centric approach to understanding the gene sequence illustrate the advantages of deep learning over humans. With billions of DNA base pairs, no humans are able to digest that amount of gene expressions. In the past, computational techniques are unfeasible due to the lack of compute power but in the 21st century, it seems that machine learning is the way to go for molecular biology.

DeepVariant’s strong performance on human data proves that deep learning is a promising technique for variant calling. Perhaps the most exciting feature of DeepVariant is its simplicity. Unlike other states of the art variant callers, DeepVariant has no knowledge of the sequencing technologies that create the reads, or even the biological processes that introduce mutations. This simplifies the problem of variant calling to preprocessing the reads and training a generic deep learning model. It also suggests that DeepVariant could be significantly improved by tailoring the preprocessing to specific sequencing technologies and/or developing a dedicated CNN architecture for the reads, rather than trying to cast them as images.

== Critique and Discussion==

The paper presents an interesting method for solving an important problem. Building "images" of reads and running them through a generic image classification CNN seems like a strange approach, and it is interesting that it works well. The biggest issues with the paper are the lack of specific information about how the methods. Some extra information is included in the supplementary material, but there are still some big gaps. In particular:

1. What is the structure of the neural net? How many layers, and what sizes? The paper for ConvNet which is cited does not have this information. We suspect that this might be a trade secret that Google is protecting.

2. How is the realignment step implemented? The paper mentions that it uses a "De-Bruijn-graph-based read assembly procedure" to realign reads to a new haplotype. This is a non-standard step in most genomics workflows yet the paper does not describe how they do the realignment or how they build the haplotypes.

3. How did they settle on the image construction algorithm? The authors provide pseudocode for the construction of pileup images but they do not describe how the decisions for made. For instance, the colour values for different base pairs are not evenly spaced. Also, the image begins with 5 rows of the reference genome.

One thing we appreciated about the paper was their commentary on future developments. The authors make it very clear that this approach can be improved on and provide specific ideas for next steps.

Overall, the paper presents an interesting idea with strong results, but lacks detail in some key pieces of the implementation.

The topic of this project is good but we need to more details of the algorithm. In the neural network part, the details are not enough, Authors should provide a figure to better explain how the model works and the structure of the model. Otherwise we cannot understand how the model works. Also, when we are preprocessing the data, if different data have different lengths, shall we add more information or drop some information so they match?

5 Particularly, which package did the researchers use to perform this analysis? Different packages of deep learning can have different accuracy and efficiency while making predictions on this data set.

Further studies on DeepVariant [https://www.nature.com/articles/s41598-018-36177-7 have shown] that it is a framework with great potential and sets the standard in the medical genetics field.

6. It was mentioned that part of the network used is an "adapted Inception v2 ImageNet model" - does this mean that it used an Inception v2 model that was pretrained on ImageNet? This is not clear, but if this is the case then why is this useful? Why would the features that are extracted for an image be useful for genomics? Did they try using a model that wasn't pretrained? Also, they describe the preprocessing method used but were there any alternatives that they considered?

7. A more extensive discussion on the "Neural Network" section can be give. For example, the last sentence of the paragraph says that "Models are trained for 80 hours, or until the training accuracy converged." This sentence implies that if the training accuracy does converge, it usually takes less than 80 hours. More interesting data can thus be presented about the training accuracy converging time. How often do the models converge? How long does it take for the models to converge on average? Moreover, why is the number 80 chosen here? Is it a random upper bound set by people to bound the runtime of the model training, or is the number 80 carefully chosen so that it's twice/three times/ten times of the average training accuracy converging time? I believe that these are interesting facts to look at.

== References ==
[1] Hartwell, L.H. ''et. al.'' ''Genetics: From Genes to Genomes''. (McGraw-Hill Ryerson, 2014).

[2] Poplin, R. ''et. al''. A universal SNP and small-indel variant caller using deep neural networks. ''Nature Biotechnology'' '''36''', 983-987 (2018).

Task Understanding from Confusing Multi-task Data

2020-12-04T05:55:27Z

Z42qin: /* Introduction */

'''Presented By'''

Qianlin Song, William Loh, Junyue Bai, Phoebe Choi

= Introduction =

Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, assist doctors to make data-driven decisions, and even self-driving cars. One of the most famous integrated forms of Narrow AI is Apple's Siri. Siri has no self-awareness or genuine intelligence, and hence often has challenges performing tasks outside its range of abilities. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI.

General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. For an isolated and very difficult task, the artificial intelligence may not learn it very well. For instance, a net with pixel dimension 1000*1000 is less likely to identify complicated objects in real-world situations on the time basis. However, if it could be learned simultaneously, it would be better as the tasks can share what they learned. It is easier for the learner to learn together instead of in isolation, for example, shapes, landmarks, textures, orientation and so on. This is called Multitask Learning. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste.

[[File:CSLFigure1.PNG | 500px]]

Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations.

This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The de-confusing function allocates samples to respective tasks and the mapping function presents the relation from the input to its label within the allocated tasks. See figure 1(b). To implement the CSL, we use a risk functional to balance the effects of the de-confusing function and mapping function.

However, simply combining the two functions or networks to a single architecture is impossible, since the one-hot constraint of the outputs for the de-confusing network makes the gradient back-propagation unfeasible. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization in the proposed architecture CLS-Net.

Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine assigned with complete information.

= Related Work =

[[File:CSLFigure2.PNG | 700px]]

==Latent variable learning==
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, we know that samples are generated from multiple distinct distributions instead of one distribution combining a mixture of multiple probability models. Thus, the latent variable learning can not fully distinguish labels into different tasks and different distributions, and it is insufficient to classify the multi-task confusing samples.

==Multi-task learning==
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved the overall learning efficiency, since the labels in different tasks are often correlated: improving the classfication result for one class also help with other classification tasks. In multi-task learning, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. If such manuual task annotation is abstent, then the algorithm can not be performed.

==Multi-label learning==
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.

= Confusing Supervised Learning =

== Description of the Problem ==

Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, let <math> (x,y)</math> be the training samples from <math>y=f(x)</math>, which is an identical but unknown mapping relationship. Assuming the risk measure is mean squared error (MSE), the expected risk function is

$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$

where <math>p(x)</math> is the data distribution of the input variable <math>x</math>. In practice, the methods select the optimal function by minimizing the empirical risk:

$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$

To minimize the risk function, the theoretically optimal solution is <math> f(x) </math>.

When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk function can be modified to fit this new task for traditional supervised learning methods.

$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$

We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math>. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. The solution represents a mixed probably model instead of knowing the exact tasks and their correpsonding individual probability distribution. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.

== Learning Functions of CSL ==

To overcome this issue, the authors introduce two types of learning functions:
* '''Deconfusing function''' — allocation of which samples come from the same task
* '''Mapping function''' — mapping relation from input to the output of every learned task

Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (using MSE loss) is

$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$

which can be estimated empirically with

$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$

The risk metric of every sample affects only its assigned task.

== Theoretical Results ==

This novel framework yields some theoretical results to show the viability of its construction.

'''Theorem 1 (Existence of Solution)'''
''With the confusing supervised learning framework, there is an optimal solution''
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$

$$g_k^*(x) = f_k(x)$$

''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''

However, necessity constraints are needed to avoid meaningless trivial solutions in all optimal risk solutions.

'''Theorem 2 (Error Bound of CSL)'''
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by

$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$

''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$

This theorem shows the method of empirical risk minimization is valid in the CSL framework. Moreover, the assumed number of tasks affects the VC dimension of the learning functions, which is positively related to the generalization error. Therefore, to make the training risk small, we need to choose the ''minimum number'' of tasks when determining the task.

= CSL-Net =
In this section, the authors describe how to implement and train a network for CSL, including the stucture of CSL-Net and Iterative deconfusing algorithm.

== The Structure of CSL-Net ==
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$

The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.

== Iterative Deconfusing Algorithm ==
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks.

'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.

'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.

The two optimization stages are carried out alternately until the solution converges.

=Experiment=
==Setup==

3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks.

'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to reproduce both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks.

'''Colorful-MNIST''': The first image classification data set consists of digit data in a range of 0 to 9, each of which is in a single color among the eight different colors. Each observation in this modified set consists of a colored image (<math>x_i</math>) and a label (<math>y_i</math>) that represents either the corresponding color, or the digit. The goal is to reproduce the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks.

'''Kaggle Fashion Product''': The second image classification data set consists of several fashion-related objects labeled from any of the 3 criteria: “gender”, “category”, and “main color”, whose number of observations is larger than that of the "colored-MNIST" data set.

==Use of Pre-Trained CNN Feature Layers==

In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks. The CSL methods autonomously learned three tasks which corresponded exactly to “Gender”,
“Category”, and “Color” as we see it.

==Metrics of Confusing Supervised Learning==

There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function.

'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.

$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$

The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.

'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.

$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$

The purpose of this measure arises from the fact that, in addition to learning mapping allocations like humans, machines should be able to approximate all mapping functions accurately in order to provide corresponding labels. The Label Prediction Accuracy measure captures the exchange equivalence of the following task: each mapping contains its ground-truth output, and machines should be predicting the correct output that is close to the ground-truth.

==Results==

Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.

'''Function Regression''': To "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well in learning the ground-truth. That means the initialization of the neural network is set up properly.

'''Image Classification''': Visualizations created through Spectral embedding confirm the task labelling proficiency of the deconfusing neural network <math>h</math>.

The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.

==Application of Multi-label Learning==

CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.

Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reactions from the text.

Multi-label can be used to improve the syndrome diagnosis of a patient by focusing on multiple syndromes instead of a single syndrome.

==Limitations==

'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.

'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.

= Conclusion =

This paper proposes the CSL method for tackling the multi-task learning problem without manual task annotations from basic input data. The model obtains a basic task concept by learning the minimum risk for confusing samples from differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.

However, some limitations can be improved for future work:

- The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process;

- The current algorithm is difficult to learn basic features directly through a CNN structure and understand tasks simultaneously by training a full-connect network. However, this limitation does not affect the effectiveness of our algorithm in learning confusing data based on pre-trained features.

= Critique =

The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.

Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.

Compared to the standard supervised learning, Multi-label learning can associate a training sample with multiple category tags at the same time. It can assign multiple labels to some hidden instances and can be reduced to standard supervised learning by limiting the number of class labels per instance.

This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.

This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed to converge because of the strong correlation in prediction between the two networks.

- It would be interesting to see this implemented in more examples, to test the robustness of different types of data. The validation tasks chosen by data are all very simple, and CSL is actually not necessary for those tasks. For the colored MNIST data, a simple function can be written to distinguish the color label from the number label. The same problem applied to the Kaggle Fashion product dataset. The candidate label can be easily classified into different tasks by some wording analysis or meaning classification program or even manual classification. Even though the idea discussed by authors are interesting, the examples suggested by authors seem to suggest very limited or even unnecessary application. In most cases, it is more beneficial to treat the Confusing Multi-task Data problems separately into two distinct stages: we classify the tasks first according to the meaning of the label, and then we perform a multi-class/multi-label training process.

Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.

When using this framework for classification, the order of the one-hot classification labels for each task will likely influence the relationships learned between each task, since the same output header is used for all tasks. This may be why this method fails to learn low-level representations and requires pretraining. I would like to see more explanation in the paper about why this isn't a problem if it was investigated.

It would be a good idea to include comparison details in the summary to make the results and the conclusion more convincing. For instance, though the paper introduced the result generated using confusion data, and provide some applications for multi-label learning, these two sections still fell short and could use some technical details as supporting evidence.

It is interesting to investigate if the order of adding tasks will influence the model performance.

It would be interesting to see the effectiveness of applying CSL in face recognition, such that not only does the algorithm map the face to identity, it also categorizes the face based on other features like beard/no beard and glasses/no glasses simultaneously.

It would be better for the researchers to compare the efficiency of this approach with other models.

For pattern recognition,pre-trained features were used in the algorithm. It would be interesting to see how the effectiveness of the model changes if we train it with data directly from the CNN structure in the future.

So basically given a confused dataset CSL finds the important tasks or labels from the dataset as can be seen from the fruit example. In the example, fruits are grouped under their names, their tastes, and their color, when CSL is given a mixed dataset. Hence given an unstructured data, unlabeled, confused dataset CSL helps in finding the labels, which in turn can help in cleaning the dataset and further in preparing high-quality training data set which is very important in different ML algorithms. Since at present preparing these dataset requires manual data annotations, CSL can save time in that process.

For the Colorful-Mnist data set, the goal is to understand the concept of multiple classification tasks from these examples. All inputs have multiple classification tasks. Each observed sample only represents the classification result of one task, and the task from which the sample comes is unknown.

It would be nice to know why the given metrics of confusing supervised learning are used. The authors should have used several different metrics and show that CSL's overall performs better than other methods. And what are "the other methods" referring to? algorithm

In the Iterative Deconfusing algorithm section, the Training of Mapping-Net needs more explanation. The authors should specify what it is doing before showing its equations.

For the results section, it would be more intuitive and stronger if the author provide more detail on these two methods and add a plot to support the claim. Based on the text, it might not be an obvious comparison.

= References =

[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."

[2] Caruana, R. (1997) "Multi-task learning"

[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8.

[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.

[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.

[6] Guo-Ping Liu, Jian-Jun Yan, Yi-Qin Wang, Jing-Jing Fu, Zhao-Xia Xu, Rui Guo, Peng Qian, "Application of Multilabel Learning Using the Relevant Feature for Each Label in Chronic Gastritis Syndrome Diagnosis", Evidence-Based Complementary and Alternative Medicine, vol. 2012, Article ID 135387, 9 pages, 2012. https://doi.org/10.1155/2012/135387

Neural Speed Reading via Skim-RNN

2020-12-04T05:45:52Z

Z42qin: /* Conclusion */

== Group ==

Mingyan Dai, Jerry Huang, Daniel Jiang

== Introduction ==

Recurrent Neural Network (RNN) is a class of artificial neural networks where the connection between nodes form a directed graph along with time series and has time dynamic behavior. RNN is derived from a feedforward neural network and can use its memory to process variable-length input sequences. In RNNs, outputs can be fed back to the network as inputs and hence creating a recurrent structure to handle varying lengths of data. This makes it suitable for tasks such as unsegmented, connected handwriting recognition, and speech recognition.

In Natural Language Processing, Recurrent Neural Network (RNN) is a common architecture used to sequentially ‘read’ input tokens and output a distributed representation for each token. By recurrently updating the hidden state of the neural network, an RNN can inherently require the same computational cost across time. However, when it comes to processing input tokens, some tokens are comparatively less important to the overall representation of a piece of text or query. For example, in the application of RNN to the question answering problem, it is not uncommon to encounter parts of a passage that are irrelevant to answering the query.

LSTM-Jump (Yu et al., 2017), a variant of LSTMs, was introduced to improve efficiency by skipping multiple tokens at a given step. Skip-RNN skips multiple steps and jumps based on RNNs’ current hidden state. In contrast, Skim-RNN takes advantage of 'skimming' rather than 'skipping tokens'. Skim-RNN predicts each word as important or unimportant. A small RNN is used if the word is not important, and a large RNN is used if the word is important. This paper demonstrates that skimming achieves higher accuracy compared to skipping tokens, implying that paying attention to unimportant tokens is better than completely ignoring them.

== Model ==

In this paper, the authors introduce a model called 'skim-RNN', which takes advantage of ‘skimming’ less important tokens or pieces of text rather than ‘skipping’ them entirely. This models the human ability to skim through passages, or to spend less time reading parts that do not affect the reader’s main objective. While this leads to a loss in the comprehension rate of the text [1], it greatly reduces the amount of time spent reading by not focusing on areas that will not significantly affect efficiency when it comes to the reader's objective.

'Skim-RNN' works by rapidly determining the significance of each input and spending less time processing unimportant input tokens by using a smaller RNN to update only a fraction of the hidden state. When the decision is to ‘fully read’, that is to not skim the text, Skim-RNN updates the entire hidden state with the default RNN cell. Since the hard decision function (‘skim’ or ‘read’) is non-differentiable, the authors use a gumbel-softmax [2] to estimate the gradient of the function, rather than traditional methods such as REINFORCE (policy gradient)[3]. Gumbel-softmax provides a differentiable sample compared to the non-differentiable sample of a categorical distribution [2]. The switching mechanism between the two RNN cells enables Skim-RNN to reduce the total number of float operations (Flop reduction, or Flop-R). A high skimming rate often leads to faster inference on CPUs, which makes it very useful for large-scale products and small devices.

The Skim-RNN has the same input and output interfaces as standard RNNs, so it can be conveniently used to speed up RNNs in existing models. In addition, the speed of Skim-RNN can be dynamically controlled at inference time by adjusting a parameter for the threshold for the ‘skim’ decision.

=== Related Works ===

As the popularity of neural networks has grown, significant attention has been given to make them faster and lighter. In particular, relevant work focused on reducing the computational cost of recurrent neural networks has been carried out by several other related works. For example, LSTM-Jump (You et al., 2017) [8] models aim to speed up run times by skipping certain input tokens, as opposed to skimming them. Choi et al. (2017)[9] proposed a model that uses a CNN-based sentence classifier to determine the most relevant sentence(s) to the question and then uses an RNN-based question-answering model. This model focuses on reducing GPU run-times (as opposed to Skim-RNN which focuses on minimizing CPU-time or Flop) and is also focused only on question answering.

=== Implementation ===

A Skim-RNN consists of two RNN cells, a default (big) RNN cell of hidden state size <math>d</math> and a small RNN cell of hidden state size <math>d'</math>, where <math>d</math> and <math>d'</math> are parameters defined by the user and <math>d' \ll d</math>. This follows the fact that there should be a small RNN cell defined for when text is meant to be skimmed and a larger one for when the text should be processed as normal.

Each RNN cell will have its own set of weights and bias as well as be any variant of an RNN. There is no requirement on how the RNN itself is structured, rather the core concept is to allow the model to dynamically make a decision as to which cell to use when processing input tokens. Note that skipping text can be incorporated by setting <math>d'</math> to 0, which means that when the input token is deemed irrelevant to a query or classification task, nothing about the information in the token is retained within the model.

Experimental results suggest that this model is faster than using a single large RNN to process all input tokens, as the smaller RNN requires fewer floating-point operations to process the token. Additionally, higher accuracy and computational efficiency are achieved.

==== Inference ====

At each time step <math>t</math>, the Skim-RNN unit takes in an input <math>{\bf x}_t \in \mathbb{R}^d</math> as well as the previous hidden state <math>{\bf h}_{t-1} \in \mathbb{R}^d</math> and outputs the new state <math>{\bf h}_t </math> (although the dimensions of the hidden state and input are the same, this process holds for different sizes as well). In the Skim-RNN, there is a hard decision that needs to be made whether to read or skim the input, although there could be potential to include options for multiple levels of skimming.

The decision to read or skim is done using a multinomial random variable <math>Q_t</math> over the probability distribution of choices <math>{\bf p}_t</math>, where

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>{\bf p}_t = \text{softmax}(\alpha({\bf x}_t, {\bf h}_{t-1})) = \text{softmax}({\bf W}[{\bf x}_t; {\bf h}_{t-1}]+{\bf b}) \in \mathbb{R}^k</math>
</div>

where <math>{\bf W} \in \mathbb{R}^{k \times 2d}</math>, <math>{\bf b} \in \mathbb{R}^{k}</math> are weights to be learned and <math>[{\bf x}_t; {\bf h}_{t-1}] \in \mathbb{R}^{2d}</math> indicates the row concatenation of the two vectors. In this case, <math> \alpha </math> can have any form as long as the complexity of calculating it is less than <math> O(d^2)</math>. Letting <math>{\bf p}^1_t</math> indicate the probability for fully reading and <math>{\bf p}^2_t</math> indicate the probability for skimming the input at time <math> t</math>, it follows that the decision to read or skim can be modelled using a random variable <math> Q_t</math> by sampling from the distribution <math>{\bf p}_t</math> and
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>Q_t \sim \text{Multinomial}({\bf p}_t)</math>
</div>

Without loss of generality, we can define <math> Q_t = 1</math> to indicate that the input will be read while <math> Q_t = 2</math> indicates that it will be skimmed. Reading requires applying the full RNN on the input as well as the previous hidden state to modify the entire hidden state while skimming only modifies part of the prior hidden state.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
{\bf h}_t = \begin{cases}
f({\bf x}_t, {\bf h}_{t-1}) & Q_t = 1\\
[f'({\bf x}_t, {\bf h}_{t-1});{\bf h}_{t-1}(d'+1:d)] & Q_t = 2
\end{cases}
</math>
</div>

where <math> f </math> is a full RNN with output of dimension <math>d</math> and <math>f'</math> is a smaller RNN with <math>d'</math>-dimensional output. This has advantage that when the model decides to skim, then the computational complexity of that step is only <math>O(d'd)</math>, which is much smaller than <math>O(d^2)</math> due to previously defining <math> d' \ll d</math>.

==== Training ====

Since the expected loss/error of the model is a random variable that depends on the sequence of random variables <math> \{Q_t\} </math>, the loss is minimized with respect to the distribution of the variables. <math> \{Q_t\} </math> is assumed to have a multinomial distribution with probability of pt. Defining the loss to be minimized while conditioning on a particular sequence of decisions

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
L(\theta\vert Q)
</math>
</div>
where <math>Q=Q_1\dots Q_T</math> is a sequence of decisions of length <math>T</math>, then the expected loss over the distribution of the sequence of decisions is

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
\mathbb{E}[L(\theta)] = \sum_{Q} L(\theta\vert Q)P(Q) = \sum_Q L(\theta\vert Q) \Pi_j {\bf p}_j^{Q_j}
</math>
</div>

Since calculating <math>\delta \mathbb{E}_{Q_t}[L(\theta)]</math> directly is rather infeasible, it is possible to approximate the gradients with a gumbel-softmax distribution [2]. Reparameterizing <math> {\bf p}_t</math> as <math> {\bf r}_t</math>, then the back-propagation can flow to <math> {\bf p}_t</math> without being blocked by <math> Q_t</math> and the approximation can arbitrarily approach <math> Q_t</math> by controlling the parameters. The reparameterized distribution is therefore

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
{\bf r}_t^i = \frac{\text{exp}(\log({\bf p}_t^i + {g_t}^i)/\tau)}{\sum_j\text{exp}(\log({\bf p}_t^j + {g_t}^j)/\tau)}
</math>
</div>

where <math>{g_t}^i</math> is an independent sample from a <math>\text{Gumbel}(0, 1) = -\log(-\log(\text{Uniform}(0, 1))</math> random variable and <math>\tau</math> is a parameter that represents a temperature. Then it can be rewritten that

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
{\bf h}_t = \sum_i {\bf r}_t^i {\bf \tilde{h}}_t
</math>
</div>

where <math>{\bf \tilde{h}}_t</math> is the previous equation for <math>{\bf h}_t</math>. The temperature parameter gradually decreases with time, and <math>{\bf r}_t^i</math> becomes more discrete as it approaches 0.

A final addition to the model is to encourage skimming when possible. Therefore an extra term related to the negative log probability of skimming and the sequence length. Therefore the final loss function used for the model is denoted by

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
L'(\theta) =L(\theta) + \gamma \cdot\frac{1}{T} \sum_i -\log({\bf \tilde{p}}^i_t)
</math>
</div>
where <math> \gamma </math> is a parameter used to control the ratio between the main loss function and the negative log probability of skimming.

== Experiment ==

The effectiveness of Skim-RNN was measured in terms of accuracy and float operation reduction on four classification tasks and a question-answering task. These tasks were chosen because they do not require one’s full attention to every detail of the text, but rather ask for capturing the high-level information (classification) or focusing on a specific portion (QA) of the text, which a common context for speed reading. The tasks themselves are listed in the table below.

[[File:Table1SkimRNN.png|center|1000px]]

=== Classification Tasks ===

In a language classification task, the input is a sequence of words and the output is the vector of categorical probabilities. Each word is embedded into a <math>d</math>-dimensional vector. We initialize the vector with GloVe [4] to form representations of the words and use those as the inputs for a long short-term memory (LSTM) architecture. A linear transformation on the last hidden state of the LSTM and then a softmax function is applied to obtain the classification probabilities. Adam [5] is used for optimization, with an initial learning rate of 0.0001. For Skim-LSTM, <math>\tau = \max(0.5, exp(−rn))</math> where <math>r = 1e-4</math> and <math>n</math> is the global training step, following [2]. We experiment on different sizes of big LSTM (<math>d \in \{100, 200\}</math>) and small LSTM (<math>d' \in \{5, 10, 20\}</math>) and the ratio between the model loss and the skim loss (<math>\gamma\in \{0.01, 0.02\}</math>) for Skim-LSTM. The batch sizes used were 32 for SST and Rotten Tomatoes, and 128 for others. For all models, early stopping is used when the validation accuracy did not increase for 3000 global steps.

==== Results ====

[[File:Table2SkimRNN.png|center|1000px]]

[[File:Figure2SkimRNN.png|center|1000px]]

Table 2 shows the accuracy and computational cost of the Skim-RNN model compared with other standard models. It is evident that the Skim-RNN model produces a speed-up on the computational complexity of the task while maintaining a high degree of accuracy. Also, it is interesting to know that the accuracy improvement over LSTM could be due to the increased stability of the hidden state, as the majority of the hidden state is not updated when skimming. Meanwhile, figure 2 demonstrates the effect of varying the size of the small hidden state as well as the parameter <math>\gamma</math> on the accuracy and computational cost.

[[File:Table3SkimRNN.png|center|1000px]]

Table 3 shows an example of a classification task over a IMDb dataset, where Skim-RNN with <math>d = 200</math>, <math>d' = 10</math>, and <math>\gamma = 0.01</math> correctly classifies it with a high skimming rate (92%). The goal is to classify the review as either positive or negative. The black words are skimmed, and the blue words are fully read. The skimmed words are clearly irrelevant and the model learns to only carefully read the important words, such as ‘liked’, ‘dreadful’, and ‘tiresome’.

=== Question Answering Task ===

In the Stanford Question Answering Dataset, the task was to locate the answer span for a given question in a context paragraph. The effectiveness of Skim-RNN for SQuAD was evaluated using two different models: LSTM+Attention and BiDAF [6]. The first model was inspired by most then-present QA systems consisting of multiple LSTM layers and an attention mechanism. This type of model is complex enough to reach reasonable accuracy on the dataset and simple enough to run well-controlled analyses for the Skim-RNN. The second model was an open-source model designed for SQuAD, used primarily to show that Skim-RNN could replace RNN in existing complex systems.

==== Training ====

Adam was used with an initial learning rate of 0.0005. For stable training, the model was pre-trained with a standard LSTM for the first 5k steps, and then fine-tuned with Skim-LSTM.

==== Results ====

[[File:Table4SkimRNN.png|center|1000px]]

Table 4 shows the accuracy (F1 and EM) of LSTM+Attention and Skim-LSTM+Attention models as well as VCRNN [7]. It can be observed from the table that the skimming models achieve higher or similar accuracy scores compared to the non-skimming models while also reducing the computational cost by more than 1.4 times. In addition, decreasing layers (1 layer) or hidden size (<math>d=5</math>) improved the computational cost but significantly decreases the accuracy compared to skimming. The table also shows that replacing LSTM with Skim-LSTM in an existing complex model (BiDAF) stably gives reduced computational cost without losing much accuracy (only 0.2% drop from 77.3% of BiDAF to 77.1% of Sk-BiDAF with <math>\gamma = 0.001</math>).

An explanation for this trend that was given is that the model is more confident about which tokens are important in the second layer. Second, higher <math>\gamma</math> values lead to a higher skimming rate, which agrees with its intended functionality.

Figure 4 shows the F1 score of LSTM+Attention model using standard LSTM and Skim LSTM, sorted in ascending order by Flop-R (computational cost). While models tend to perform better with larger computational cost, Skim LSTM (Red) outperforms standard LSTM (Blue) with a comparable computational cost. It can also be seen that the computational cost of Skim-LSTM is more stable across different configurations and computational cost. Moreover, increasing the value of <math>\gamma</math> for Skim-LSTM gradually increases the skipping rate and Flop-R, while it also led to reduced accuracy.

=== Runtime Benchmark ===

[[File:Figure6SkimRNN.png|center|1000px]]

The details of the runtime benchmarks for LSTM and Skim-LSTM, which are used to estimate the speedup of Skim-LSTM-based models in the experiments, are also discussed. A CPU-based benchmark was assumed to be the default benchmark, which has a direct correlation with the number of float operations that can be performed per second. As mentioned previously, the speed-up results in Table 2 (as well as Figure 7) are benchmarked using Python (NumPy), instead of popular frameworks such as TensorFlow or PyTorch.

Figure 7 shows the relative speed gain of Skim-LSTM compared to standard LSTM with varying hidden state size and skim rate. NumPy was used, with the inferences run on a single thread of CPU. The ratio between the reduction of the number of float operations (Flop-R) of LSTM and Skim-LSTM was plotted, with the ratio acting as a theoretical upper bound of the speed gain on CPUs. From here, it can be noticed that there is a gap between the actual gain and the theoretical gain in speed, with the gap being larger with more overhead of the framework or more parallelization. The gap also decreases as the hidden state size increases because the overhead becomes negligible with very large matrix operations. This indicates that Skim-RNN provides greater benefits for RNNs with larger hidden state size. However, combining Skim-RNN with a CPU-based framework can lead to substantially lower latency than GPUs.

== Results ==

The results clearly indicate that the Skim-RNN model provides features that are suitable for general reading tasks, which include classification and question answering. While the tables indicate that minor losses in accuracy occasionally did result when parameters were set at specific values, they were minor and were acceptable given the improvement in runtime.

'''Controlling skim rate:''' An important advantage of Skim-RNN is that the skim rate (and thus computational cost) can be dynamically controlled at inference time by adjusting the threshold for
‘skim’ decision probability <math>{\bf p}^1_t</math>. Figure 5 shows the trade-off between the accuracy and computational cost for two settings, confirming the importance of skimming (<math>d' > 0</math>) compared to skipping (<math>d' = 0</math>).

'''Visualization:''' Figure 6 shows that the model does not skim when the input seems to be relevant to answering the question, which was as expected by the design of the model. In addition, the LSTM in the second layer skims more than that in the first layer mainly because the second layer is more confident about the importance of each token.

== Conclusion ==

A Skim-RNN can offer better latency results on a CPU compared to a standard RNN on a GPU, with lower computational cost and better latency, as demonstrated through the results of this study. Compared to RNN, Skim-RNN takes the advantage of "skimming" rather than "reading", spends less time on parts of the input that is unimportant. Future work (as stated by the authors) involves using Skim-RNN for applications that require much higher hidden state size, such as video understanding, and using multiple small RNN cells for varying degrees of skimming. Further, since it has the same input and output interface as a regular RNN, it can replace RNNs in existing applications.

== Critiques ==

1. It seems like Skim-RNN is using the not full RNN of processing words that are not important, thus it can increase speed in some very particular circumstances (ie, only small networks). The extra model complexity did slow down the speed while trying to "optimizing" the efficiency and sacrifice part of accuracy while doing so. It is only trying to target a very specific situation (classification/question-answering) and made comparisons only with the baseline LSTM model. It would be definitely more persuasive if the model can compare with some of the state of art neural network models.

2. This model of Skim-RNN is pretty good to extract binary classification type of text, thus it would be interesting for this to be applied to stock market news analysis. For example, a press release from a company can be analyzed quickly using this model and immediately give the trader a positive or negative summary of the news. Would be beneficial in trading since time and speed is an important factor when executing a trade.

3. An appropriate application for Skim-RNN could be customer service chatbots as they can analyze a customer's message and skim associated company policies to craft a response. In this circumstance, quickly analyzing text is ideal to not waste customers' time.

4. This could be applied to news apps to improve readability by highlighting important sections.

5. This summary describes an interesting and useful model that can save readers time for reading an article. I think it will be interesting that discuss more on training a model by Skim-RNN to highlight the important sections in very long textbooks. As a student, having highlights in the textbook is really helpful to study. But highlight the important parts in a time-consuming work for the author, maybe using Skim-RNN can provide a nice model to do this job.

6. Besides the good training performance of Skim-RNN, it's good to see the algorithm even performs well simply by training with CPU. It would make it possible to perform the result on lite-platforms.

7. Another good application of Skim-RNN could be in reading the terms and conditions of websites. A lot of the terms and conditions documents tend to be painfully long and majority of customers because of the length of the document tend to sign them without reading them. Since these websites can compromise the personal information of the customers by giving it to third parties, it is worthwhile for the customers to know at least what they are signing. Infact, this can also be applied to read important legal official documents like lease agreements etc.

8. Another [https://arxiv.org/abs/1904.00761 paper], written by Christian Hansen et al., also discussed Neural Speed Reading. However, conducting Neural Speed reading via a Skim-RNN, this paper suggested using a Structural-Jump-LSTM. The Structural-Jump-LSTM can both skip and jump test during inference. This model consisted of a standard LSTM with 2 agents: 1 capable of bypassing single words and another capable of bypassing punctuation.

9. Another method of increasing the speed of human reading is to quickly flash each of the words in an article in quick succession in the middle of the screen. However, it is important that relative duration each word spends on the screen is adjusted, so that a human reader will feel natural reading it. Could the skimming results from this article be incorporated into the display-one-word-at-a-time method to even further increase the speed one can read an article? Would adding color to the unskimmed words also help increase reading speed?

10. It is interesting how the paper mentions the correlation between a system’s performance in a given reading-based task and the speed of reading. This conclusion was reached by the human reading pattern with neural attention in an unsupervised learning approach.

11. With the comparison in figure 7, skim-LSTM has a definite advantage over the original LSTM model on the speed which give us an inspiration that we can boost the speed of original LSTM model used in daily life applications such as stock forecasting.

12. The paper never specifies which type of CPU they tested on (Intel vs AMD). Would be interesting to see which manufacturer's CPU Skim-RNN works better on as each CPU has different FLOP values.

== Applications ==

Recurrent architectures are used in many other applications:

1. '''Real-time video processing''' is an exceedingly demanding and resource-constrained task, particularly in edge settings.

2. '''Music Recommend system''' is which many music platforms such as Spotify and Pandora used for recommending music based on the user's information.

3. '''Speech recognition''' enables the recognition and translation of spoken language into text by computers.

It would be interesting to see if this method could be applied to those cases for more efficient inference, such as on drones or self-driving cars. Another possible application is real-time edge processing of game video for sports arenas.

== References ==

[1] Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.

[2] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.

[3] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.

[4] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.

[5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[6] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017a.

[7] Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. In ICLR, 2017.

[8] Adams Wei Yu, Hongrae Lee, and Quoc V Le. Learning to skim text. In ACL, 2017.

[9] Eunsol Choi, Daniel Hewlett, Alexandre Lacoste, Illia Polosukhin, Jakob Uszkoreit, and Jonathan Berant. Coarse-to-fine question answering for long documents. In ACL, 2017.

Neural Speed Reading via Skim-RNN

2020-12-04T05:38:11Z

Z42qin: /* Training */

== Group ==

Mingyan Dai, Jerry Huang, Daniel Jiang

== Introduction ==

Recurrent Neural Network (RNN) is a class of artificial neural networks where the connection between nodes form a directed graph along with time series and has time dynamic behavior. RNN is derived from a feedforward neural network and can use its memory to process variable-length input sequences. In RNNs, outputs can be fed back to the network as inputs and hence creating a recurrent structure to handle varying lengths of data. This makes it suitable for tasks such as unsegmented, connected handwriting recognition, and speech recognition.

In Natural Language Processing, Recurrent Neural Network (RNN) is a common architecture used to sequentially ‘read’ input tokens and output a distributed representation for each token. By recurrently updating the hidden state of the neural network, an RNN can inherently require the same computational cost across time. However, when it comes to processing input tokens, some tokens are comparatively less important to the overall representation of a piece of text or query. For example, in the application of RNN to the question answering problem, it is not uncommon to encounter parts of a passage that are irrelevant to answering the query.

LSTM-Jump (Yu et al., 2017), a variant of LSTMs, was introduced to improve efficiency by skipping multiple tokens at a given step. Skip-RNN skips multiple steps and jumps based on RNNs’ current hidden state. In contrast, Skim-RNN takes advantage of 'skimming' rather than 'skipping tokens'. Skim-RNN predicts each word as important or unimportant. A small RNN is used if the word is not important, and a large RNN is used if the word is important. This paper demonstrates that skimming achieves higher accuracy compared to skipping tokens, implying that paying attention to unimportant tokens is better than completely ignoring them.

== Model ==

In this paper, the authors introduce a model called 'skim-RNN', which takes advantage of ‘skimming’ less important tokens or pieces of text rather than ‘skipping’ them entirely. This models the human ability to skim through passages, or to spend less time reading parts that do not affect the reader’s main objective. While this leads to a loss in the comprehension rate of the text [1], it greatly reduces the amount of time spent reading by not focusing on areas that will not significantly affect efficiency when it comes to the reader's objective.

'Skim-RNN' works by rapidly determining the significance of each input and spending less time processing unimportant input tokens by using a smaller RNN to update only a fraction of the hidden state. When the decision is to ‘fully read’, that is to not skim the text, Skim-RNN updates the entire hidden state with the default RNN cell. Since the hard decision function (‘skim’ or ‘read’) is non-differentiable, the authors use a gumbel-softmax [2] to estimate the gradient of the function, rather than traditional methods such as REINFORCE (policy gradient)[3]. Gumbel-softmax provides a differentiable sample compared to the non-differentiable sample of a categorical distribution [2]. The switching mechanism between the two RNN cells enables Skim-RNN to reduce the total number of float operations (Flop reduction, or Flop-R). A high skimming rate often leads to faster inference on CPUs, which makes it very useful for large-scale products and small devices.

The Skim-RNN has the same input and output interfaces as standard RNNs, so it can be conveniently used to speed up RNNs in existing models. In addition, the speed of Skim-RNN can be dynamically controlled at inference time by adjusting a parameter for the threshold for the ‘skim’ decision.

=== Related Works ===

As the popularity of neural networks has grown, significant attention has been given to make them faster and lighter. In particular, relevant work focused on reducing the computational cost of recurrent neural networks has been carried out by several other related works. For example, LSTM-Jump (You et al., 2017) [8] models aim to speed up run times by skipping certain input tokens, as opposed to skimming them. Choi et al. (2017)[9] proposed a model that uses a CNN-based sentence classifier to determine the most relevant sentence(s) to the question and then uses an RNN-based question-answering model. This model focuses on reducing GPU run-times (as opposed to Skim-RNN which focuses on minimizing CPU-time or Flop) and is also focused only on question answering.

=== Implementation ===

A Skim-RNN consists of two RNN cells, a default (big) RNN cell of hidden state size <math>d</math> and a small RNN cell of hidden state size <math>d'</math>, where <math>d</math> and <math>d'</math> are parameters defined by the user and <math>d' \ll d</math>. This follows the fact that there should be a small RNN cell defined for when text is meant to be skimmed and a larger one for when the text should be processed as normal.

Each RNN cell will have its own set of weights and bias as well as be any variant of an RNN. There is no requirement on how the RNN itself is structured, rather the core concept is to allow the model to dynamically make a decision as to which cell to use when processing input tokens. Note that skipping text can be incorporated by setting <math>d'</math> to 0, which means that when the input token is deemed irrelevant to a query or classification task, nothing about the information in the token is retained within the model.

Experimental results suggest that this model is faster than using a single large RNN to process all input tokens, as the smaller RNN requires fewer floating-point operations to process the token. Additionally, higher accuracy and computational efficiency are achieved.

==== Inference ====

At each time step <math>t</math>, the Skim-RNN unit takes in an input <math>{\bf x}_t \in \mathbb{R}^d</math> as well as the previous hidden state <math>{\bf h}_{t-1} \in \mathbb{R}^d</math> and outputs the new state <math>{\bf h}_t </math> (although the dimensions of the hidden state and input are the same, this process holds for different sizes as well). In the Skim-RNN, there is a hard decision that needs to be made whether to read or skim the input, although there could be potential to include options for multiple levels of skimming.

The decision to read or skim is done using a multinomial random variable <math>Q_t</math> over the probability distribution of choices <math>{\bf p}_t</math>, where

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>{\bf p}_t = \text{softmax}(\alpha({\bf x}_t, {\bf h}_{t-1})) = \text{softmax}({\bf W}[{\bf x}_t; {\bf h}_{t-1}]+{\bf b}) \in \mathbb{R}^k</math>
</div>

where <math>{\bf W} \in \mathbb{R}^{k \times 2d}</math>, <math>{\bf b} \in \mathbb{R}^{k}</math> are weights to be learned and <math>[{\bf x}_t; {\bf h}_{t-1}] \in \mathbb{R}^{2d}</math> indicates the row concatenation of the two vectors. In this case, <math> \alpha </math> can have any form as long as the complexity of calculating it is less than <math> O(d^2)</math>. Letting <math>{\bf p}^1_t</math> indicate the probability for fully reading and <math>{\bf p}^2_t</math> indicate the probability for skimming the input at time <math> t</math>, it follows that the decision to read or skim can be modelled using a random variable <math> Q_t</math> by sampling from the distribution <math>{\bf p}_t</math> and
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>Q_t \sim \text{Multinomial}({\bf p}_t)</math>
</div>

Without loss of generality, we can define <math> Q_t = 1</math> to indicate that the input will be read while <math> Q_t = 2</math> indicates that it will be skimmed. Reading requires applying the full RNN on the input as well as the previous hidden state to modify the entire hidden state while skimming only modifies part of the prior hidden state.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
{\bf h}_t = \begin{cases}
f({\bf x}_t, {\bf h}_{t-1}) & Q_t = 1\\
[f'({\bf x}_t, {\bf h}_{t-1});{\bf h}_{t-1}(d'+1:d)] & Q_t = 2
\end{cases}
</math>
</div>

where <math> f </math> is a full RNN with output of dimension <math>d</math> and <math>f'</math> is a smaller RNN with <math>d'</math>-dimensional output. This has advantage that when the model decides to skim, then the computational complexity of that step is only <math>O(d'd)</math>, which is much smaller than <math>O(d^2)</math> due to previously defining <math> d' \ll d</math>.

==== Training ====

Since the expected loss/error of the model is a random variable that depends on the sequence of random variables <math> \{Q_t\} </math>, the loss is minimized with respect to the distribution of the variables. <math> \{Q_t\} </math> is assumed to have a multinomial distribution with probability of pt. Defining the loss to be minimized while conditioning on a particular sequence of decisions

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
L(\theta\vert Q)
</math>
</div>
where <math>Q=Q_1\dots Q_T</math> is a sequence of decisions of length <math>T</math>, then the expected loss over the distribution of the sequence of decisions is

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
\mathbb{E}[L(\theta)] = \sum_{Q} L(\theta\vert Q)P(Q) = \sum_Q L(\theta\vert Q) \Pi_j {\bf p}_j^{Q_j}
</math>
</div>

Since calculating <math>\delta \mathbb{E}_{Q_t}[L(\theta)]</math> directly is rather infeasible, it is possible to approximate the gradients with a gumbel-softmax distribution [2]. Reparameterizing <math> {\bf p}_t</math> as <math> {\bf r}_t</math>, then the back-propagation can flow to <math> {\bf p}_t</math> without being blocked by <math> Q_t</math> and the approximation can arbitrarily approach <math> Q_t</math> by controlling the parameters. The reparameterized distribution is therefore

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
{\bf r}_t^i = \frac{\text{exp}(\log({\bf p}_t^i + {g_t}^i)/\tau)}{\sum_j\text{exp}(\log({\bf p}_t^j + {g_t}^j)/\tau)}
</math>
</div>

where <math>{g_t}^i</math> is an independent sample from a <math>\text{Gumbel}(0, 1) = -\log(-\log(\text{Uniform}(0, 1))</math> random variable and <math>\tau</math> is a parameter that represents a temperature. Then it can be rewritten that

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
{\bf h}_t = \sum_i {\bf r}_t^i {\bf \tilde{h}}_t
</math>
</div>

where <math>{\bf \tilde{h}}_t</math> is the previous equation for <math>{\bf h}_t</math>. The temperature parameter gradually decreases with time, and <math>{\bf r}_t^i</math> becomes more discrete as it approaches 0.

A final addition to the model is to encourage skimming when possible. Therefore an extra term related to the negative log probability of skimming and the sequence length. Therefore the final loss function used for the model is denoted by

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
L'(\theta) =L(\theta) + \gamma \cdot\frac{1}{T} \sum_i -\log({\bf \tilde{p}}^i_t)
</math>
</div>
where <math> \gamma </math> is a parameter used to control the ratio between the main loss function and the negative log probability of skimming.

== Experiment ==

The effectiveness of Skim-RNN was measured in terms of accuracy and float operation reduction on four classification tasks and a question-answering task. These tasks were chosen because they do not require one’s full attention to every detail of the text, but rather ask for capturing the high-level information (classification) or focusing on a specific portion (QA) of the text, which a common context for speed reading. The tasks themselves are listed in the table below.

[[File:Table1SkimRNN.png|center|1000px]]

=== Classification Tasks ===

In a language classification task, the input is a sequence of words and the output is the vector of categorical probabilities. Each word is embedded into a <math>d</math>-dimensional vector. We initialize the vector with GloVe [4] to form representations of the words and use those as the inputs for a long short-term memory (LSTM) architecture. A linear transformation on the last hidden state of the LSTM and then a softmax function is applied to obtain the classification probabilities. Adam [5] is used for optimization, with an initial learning rate of 0.0001. For Skim-LSTM, <math>\tau = \max(0.5, exp(−rn))</math> where <math>r = 1e-4</math> and <math>n</math> is the global training step, following [2]. We experiment on different sizes of big LSTM (<math>d \in \{100, 200\}</math>) and small LSTM (<math>d' \in \{5, 10, 20\}</math>) and the ratio between the model loss and the skim loss (<math>\gamma\in \{0.01, 0.02\}</math>) for Skim-LSTM. The batch sizes used were 32 for SST and Rotten Tomatoes, and 128 for others. For all models, early stopping is used when the validation accuracy did not increase for 3000 global steps.

==== Results ====

[[File:Table2SkimRNN.png|center|1000px]]

[[File:Figure2SkimRNN.png|center|1000px]]

Table 2 shows the accuracy and computational cost of the Skim-RNN model compared with other standard models. It is evident that the Skim-RNN model produces a speed-up on the computational complexity of the task while maintaining a high degree of accuracy. Also, it is interesting to know that the accuracy improvement over LSTM could be due to the increased stability of the hidden state, as the majority of the hidden state is not updated when skimming. Meanwhile, figure 2 demonstrates the effect of varying the size of the small hidden state as well as the parameter <math>\gamma</math> on the accuracy and computational cost.

[[File:Table3SkimRNN.png|center|1000px]]

Table 3 shows an example of a classification task over a IMDb dataset, where Skim-RNN with <math>d = 200</math>, <math>d' = 10</math>, and <math>\gamma = 0.01</math> correctly classifies it with a high skimming rate (92%). The goal is to classify the review as either positive or negative. The black words are skimmed, and the blue words are fully read. The skimmed words are clearly irrelevant and the model learns to only carefully read the important words, such as ‘liked’, ‘dreadful’, and ‘tiresome’.

=== Question Answering Task ===

In the Stanford Question Answering Dataset, the task was to locate the answer span for a given question in a context paragraph. The effectiveness of Skim-RNN for SQuAD was evaluated using two different models: LSTM+Attention and BiDAF [6]. The first model was inspired by most then-present QA systems consisting of multiple LSTM layers and an attention mechanism. This type of model is complex enough to reach reasonable accuracy on the dataset and simple enough to run well-controlled analyses for the Skim-RNN. The second model was an open-source model designed for SQuAD, used primarily to show that Skim-RNN could replace RNN in existing complex systems.

==== Training ====

Adam was used with an initial learning rate of 0.0005. For stable training, the model was pre-trained with a standard LSTM for the first 5k steps, and then fine-tuned with Skim-LSTM.

==== Results ====

[[File:Table4SkimRNN.png|center|1000px]]

Table 4 shows the accuracy (F1 and EM) of LSTM+Attention and Skim-LSTM+Attention models as well as VCRNN [7]. It can be observed from the table that the skimming models achieve higher or similar accuracy scores compared to the non-skimming models while also reducing the computational cost by more than 1.4 times. In addition, decreasing layers (1 layer) or hidden size (<math>d=5</math>) improved the computational cost but significantly decreases the accuracy compared to skimming. The table also shows that replacing LSTM with Skim-LSTM in an existing complex model (BiDAF) stably gives reduced computational cost without losing much accuracy (only 0.2% drop from 77.3% of BiDAF to 77.1% of Sk-BiDAF with <math>\gamma = 0.001</math>).

An explanation for this trend that was given is that the model is more confident about which tokens are important in the second layer. Second, higher <math>\gamma</math> values lead to a higher skimming rate, which agrees with its intended functionality.

Figure 4 shows the F1 score of LSTM+Attention model using standard LSTM and Skim LSTM, sorted in ascending order by Flop-R (computational cost). While models tend to perform better with larger computational cost, Skim LSTM (Red) outperforms standard LSTM (Blue) with a comparable computational cost. It can also be seen that the computational cost of Skim-LSTM is more stable across different configurations and computational cost. Moreover, increasing the value of <math>\gamma</math> for Skim-LSTM gradually increases the skipping rate and Flop-R, while it also led to reduced accuracy.

=== Runtime Benchmark ===

[[File:Figure6SkimRNN.png|center|1000px]]

The details of the runtime benchmarks for LSTM and Skim-LSTM, which are used to estimate the speedup of Skim-LSTM-based models in the experiments, are also discussed. A CPU-based benchmark was assumed to be the default benchmark, which has a direct correlation with the number of float operations that can be performed per second. As mentioned previously, the speed-up results in Table 2 (as well as Figure 7) are benchmarked using Python (NumPy), instead of popular frameworks such as TensorFlow or PyTorch.

Figure 7 shows the relative speed gain of Skim-LSTM compared to standard LSTM with varying hidden state size and skim rate. NumPy was used, with the inferences run on a single thread of CPU. The ratio between the reduction of the number of float operations (Flop-R) of LSTM and Skim-LSTM was plotted, with the ratio acting as a theoretical upper bound of the speed gain on CPUs. From here, it can be noticed that there is a gap between the actual gain and the theoretical gain in speed, with the gap being larger with more overhead of the framework or more parallelization. The gap also decreases as the hidden state size increases because the overhead becomes negligible with very large matrix operations. This indicates that Skim-RNN provides greater benefits for RNNs with larger hidden state size. However, combining Skim-RNN with a CPU-based framework can lead to substantially lower latency than GPUs.

== Results ==

The results clearly indicate that the Skim-RNN model provides features that are suitable for general reading tasks, which include classification and question answering. While the tables indicate that minor losses in accuracy occasionally did result when parameters were set at specific values, they were minor and were acceptable given the improvement in runtime.

'''Controlling skim rate:''' An important advantage of Skim-RNN is that the skim rate (and thus computational cost) can be dynamically controlled at inference time by adjusting the threshold for
‘skim’ decision probability <math>{\bf p}^1_t</math>. Figure 5 shows the trade-off between the accuracy and computational cost for two settings, confirming the importance of skimming (<math>d' > 0</math>) compared to skipping (<math>d' = 0</math>).

'''Visualization:''' Figure 6 shows that the model does not skim when the input seems to be relevant to answering the question, which was as expected by the design of the model. In addition, the LSTM in the second layer skims more than that in the first layer mainly because the second layer is more confident about the importance of each token.

== Conclusion ==

A Skim-RNN can offer better latency results on a CPU compared to a standard RNN on a GPU, with lower computational cost, as demonstrated through the results of this study. Compared to RNN, Skim-RNN takes the advantage of "skimming" rather than "reading", spends less time on parts of the input that is unimportant. Future work (as stated by the authors) involves using Skim-RNN for applications that require much higher hidden state size, such as video understanding, and using multiple small RNN cells for varying degrees of skimming. Further, since it has the same input and output interface as a regular RNN, it can replace RNNs in existing applications.

== Critiques ==

1. It seems like Skim-RNN is using the not full RNN of processing words that are not important, thus it can increase speed in some very particular circumstances (ie, only small networks). The extra model complexity did slow down the speed while trying to "optimizing" the efficiency and sacrifice part of accuracy while doing so. It is only trying to target a very specific situation (classification/question-answering) and made comparisons only with the baseline LSTM model. It would be definitely more persuasive if the model can compare with some of the state of art neural network models.

2. This model of Skim-RNN is pretty good to extract binary classification type of text, thus it would be interesting for this to be applied to stock market news analysis. For example, a press release from a company can be analyzed quickly using this model and immediately give the trader a positive or negative summary of the news. Would be beneficial in trading since time and speed is an important factor when executing a trade.

3. An appropriate application for Skim-RNN could be customer service chatbots as they can analyze a customer's message and skim associated company policies to craft a response. In this circumstance, quickly analyzing text is ideal to not waste customers' time.

4. This could be applied to news apps to improve readability by highlighting important sections.

5. This summary describes an interesting and useful model that can save readers time for reading an article. I think it will be interesting that discuss more on training a model by Skim-RNN to highlight the important sections in very long textbooks. As a student, having highlights in the textbook is really helpful to study. But highlight the important parts in a time-consuming work for the author, maybe using Skim-RNN can provide a nice model to do this job.

6. Besides the good training performance of Skim-RNN, it's good to see the algorithm even performs well simply by training with CPU. It would make it possible to perform the result on lite-platforms.

7. Another good application of Skim-RNN could be in reading the terms and conditions of websites. A lot of the terms and conditions documents tend to be painfully long and majority of customers because of the length of the document tend to sign them without reading them. Since these websites can compromise the personal information of the customers by giving it to third parties, it is worthwhile for the customers to know at least what they are signing. Infact, this can also be applied to read important legal official documents like lease agreements etc.

8. Another [https://arxiv.org/abs/1904.00761 paper], written by Christian Hansen et al., also discussed Neural Speed Reading. However, conducting Neural Speed reading via a Skim-RNN, this paper suggested using a Structural-Jump-LSTM. The Structural-Jump-LSTM can both skip and jump test during inference. This model consisted of a standard LSTM with 2 agents: 1 capable of bypassing single words and another capable of bypassing punctuation.

9. Another method of increasing the speed of human reading is to quickly flash each of the words in an article in quick succession in the middle of the screen. However, it is important that relative duration each word spends on the screen is adjusted, so that a human reader will feel natural reading it. Could the skimming results from this article be incorporated into the display-one-word-at-a-time method to even further increase the speed one can read an article? Would adding color to the unskimmed words also help increase reading speed?

10. It is interesting how the paper mentions the correlation between a system’s performance in a given reading-based task and the speed of reading. This conclusion was reached by the human reading pattern with neural attention in an unsupervised learning approach.

11. With the comparison in figure 7, skim-LSTM has a definite advantage over the original LSTM model on the speed which give us an inspiration that we can boost the speed of original LSTM model used in daily life applications such as stock forecasting.

12. The paper never specifies which type of CPU they tested on (Intel vs AMD). Would be interesting to see which manufacturer's CPU Skim-RNN works better on as each CPU has different FLOP values.

== Applications ==

Recurrent architectures are used in many other applications:

1. '''Real-time video processing''' is an exceedingly demanding and resource-constrained task, particularly in edge settings.

2. '''Music Recommend system''' is which many music platforms such as Spotify and Pandora used for recommending music based on the user's information.

3. '''Speech recognition''' enables the recognition and translation of spoken language into text by computers.

It would be interesting to see if this method could be applied to those cases for more efficient inference, such as on drones or self-driving cars. Another possible application is real-time edge processing of game video for sports arenas.

== References ==

[1] Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.

[2] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.

[3] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.

[4] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.

[5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[6] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017a.

[7] Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. In ICLR, 2017.

[8] Adams Wei Yu, Hongrae Lee, and Quoc V Le. Learning to skim text. In ACL, 2017.

[9] Eunsol Choi, Daniel Hewlett, Alexandre Lacoste, Illia Polosukhin, Jakob Uszkoreit, and Jonathan Berant. Coarse-to-fine question answering for long documents. In ACL, 2017.

User:Yktan

2020-12-04T05:29:44Z

Z42qin: /* Model Architecture and Algorithm */

== Introduction ==

Much of the success in training deep neural networks (DNNs) is due to the collection of large datasets with human-annotated labels. However, human annotation is both a time-consuming and expensive task, especially for data that requires expertise such as medical data. Furthermore, certain datasets will be noisy due to the biases introduced by different annotators. Data obtained in large quantities through searching for images in search engines and data downloaded from social media sites (in a manner abiding by privacy and copyright laws) are especially noisy, since the labels are generally inferred from tags to save on human-annotation cost.

There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. Other LNL methods estimate a noise transition matrix and employ it to correct the loss function. An example of a popular loss correction approach is the bootstrapping loss approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples. The main limitation of these methods is that they do not perform well under high noise ratio and cause overfitting.

This paper introduces DivideMix, which combines approaches from LNL and SSL. One unique thing about DivideMix is that it discards sample labels that are highly likely to be noisy and leverages these noisy samples as unlabeled data instead. This prevents the model from overfitting and improves generalization performance. Key contributions of this work are:
1) Co-divide, which trains two networks simultaneously, aims to improve generalization and avoid confirmation bias.
2) During the SSL phase, an improvement is made on an existing method (MixMatch) by combining it with another method (MixUp).
3) Significant improvements to state-of-the-art results on multiple conditions are experimentally shown while using DivideMix. Extensive ablation study and qualitative results are also shown to examine the effect of different components.

== Motivation ==

While much has been achieved in training DNNs with noisy labels and SSL methods individually, not much progress has been made in exploring their underlying connections and building on top of the two approaches simultaneously.

Existing LNL methods aim to correct the loss function by:
<ol>
<li> Treating all samples equally and correcting loss explicitly or implicitly through relabelling of the noisy samples
<li> Reweighting training samples or separating clean and noisy samples, which results in correction of the loss function
</ol>

A few examples of LNL methods include:
<ol>
<li> Estimating the noise transition matrix, which denotes the probability of clean labels flipping to noisy labels, to correct the loss function
<li> Leveraging the predictions from DNNs to correct labels and using them to modify the loss
<li> Reweighting samples so that noisy labels contribute less to the loss
</ol>

However, these methods all have downsides: it is very challenging to correctly estimate the noise transition matrix in the first method; for the second method, DNNs tend to overfit to datasets with high noise ratio; and for the third method, we need to be able to identify clean samples, which has also proven to be challenging.

On the other hand, SSL methods mostly leverage unlabeled data using regularization to improve model performance. A recently proposed method, MixMatch, incorporates the two classes of regularization. These classes are consistency regularization which enforces the model to produce consistent predictions on augmented input data, and entropy minimization which encourages the model to give high-confidence predictions on unlabeled data, as well as MixUp regularization.

DivideMix partially adopts LNL in that it removes the labels that are highly likely to be noisy by using co-divide to avoid the confirmation bias problem. It then utilizes the noisy samples as unlabeled data and adopts an improved version of MixMatch (an SSL technique) which accounts for the label noise during the label co-refinement and co-guessing phase. By incorporating SSL techniques into LNL and taking the best of both worlds, DivideMix aims to produce highly promising results in training DNNs by better addressing the confirmation bias problem, more accurately distinguishing and utilizing noisy samples, and performing well under high levels of noise.

== Model Architecture and Algorithm ==

DivideMix leverages semi-supervised learning to achieve effective modeling. The sample is first split into a labeled set and an unlabeled set. This is achieved by fitting a Gaussian Mixture Model as a per-sample loss distribution. The unlabeled set is made up of data points with discarded labels deemed noisy.

Then, to avoid confirmation bias, which is typical when a model is self-training, two models are being trained simultaneously to filter error for each other. This is done by dividing the data into clean labeled set (X)and a noisy unlabeled set (U) using one model and then training the other model on the data set U. This algorithm, known as Co-divide, keeps the two networks from converging when training, which avoids the bias from occurring. Gaussian Mixture Model is better at distinguishing X and U, whereas Beta Mixture Model produces flat distribution and fails to label correctly. Being diverged also offers the two networks distinct abilities to filter different types of error, making the model more robust to noise. However, the model could still have confirmation error where both model would prone to make and confirm the same mistake. Figure 1 describes the algorithm in graphical form.

[[File:ModelArchitecture.PNG | center]]

<div align="center">Figure 1: Model Architecture of DivideMix</div>

For each epoch, the network divides the dataset into a labeled set consisting of clean data, and an unlabeled set consisting of noisy data, which is then used as training data for the other network, where training is done in mini-batches. For each batch of the labelled samples, co-refinement is performed by using the ground truth label <math> y_b </math>, the predicted label <math> p_b </math>, and the posterior is used as the weight, <math> w_b </math>.

<center><math> \bar{y}_b = w_b y_b + (1-w_b) p_b </math></center>

Then, a sharpening function is implemented on this weighted sum to produce the estimate with reduced temperature, <math> \hat{y}_b </math>.

<center><math> \hat{y}_b=Sharpen(\bar{y}_b,T)={\bar{y}^{c{\frac{1}{T}}}_b}/{\sum_{c=1}^C\bar{y}^{c{\frac{1}{T}}}_b} </math>, for <math>c = 1, 2,..,C</math></center>

Using all these predicted labels, the unlabeled samples will then be assigned a "co-guessed" label, which should produce a more accurate prediction. Having calculated all these labels, MixMatch is applied to the combined mini-batch of labeled, <math> \hat{X} </math> and unlabeled data, <math> \hat{U} </math>, where, for a pair of samples and their labels, one new sample and new label is produced. More specifically, for a pair of samples <math> (x_1,x_2) </math> and their labels <math> (p_1,p_2) </math>, the mixed sample <math> (x',p') </math> is:

<center>
<math>
\begin{alignat}{2}

\lambda &\sim Beta(\alpha, \alpha) \\
\lambda ' &= max(\lambda, 1 - \lambda) \\
x' &= \lambda ' x_1 + (1 - \lambda ' ) x_2 \\
p' &= \lambda ' p_1 + (1 - \lambda ' ) p_2 \\

\end{alignat}
</math>
</center>

MixMatch transforms <math> \hat{X} </math> and <math> \hat{U} </math> into <math> X' </math> and <math> U' </math>. Then, the loss on <math> X' </math>, <math> L_X </math> (Cross-entropy loss) and the loss on <math> U' </math>, <math> L_U </math> (Mean Squared Error) are calculated. A regularization term, <math> L_{reg} </math>, is introduced to regularize the model's average output across all samples in the mini-batch. Then, the total loss is calculated as:

<center><math> L = L_X + \lambda_u L_U + \lambda_r L_{reg} </math></center>

where <math> \lambda_r </math> is set to 1, and <math> \lambda_u </math> is used to control the unsupervised loss.

Lastly, the stochastic gradient descent formula is updated with the calculated loss, <math> L </math>, and the estimated parameters, <math> \boldsymbol{ \theta } </math>.

The full algorithm is shown below. [[File:dividemix.jpg|600px| | center]]
<div align="center">Algorithm1: DivideMix. Line 4-8: co-divide; Line 17-18: label co-refinement; Line 20: co-guessing.</div>

The when the model is warmed up, it is trained on all data using standard cross-entropy to initially converge the model, but with a regulatory negative entropy term <math>\mathcal{H} = -\sum_{c}\text{p}^\text{c}_\text{model}(x;\theta)\log(\text{p}^\text{c}_\text{model}(x;\theta))</math>, where <math>\text{p}^\text{c}_\text{model}</math> is the softmax output probability for class c. This term penalizes confident predictions during the warm up to prevent overfitting to noise during the warm up, which can happen when there is asymmetric noise.

== Results ==
'''Applications'''

The method was validated using four benchmark datasets: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009) which contain 50K training images and 10K test images of size 32 × 32), Clothing1M (Xiao et al., 2015), and WebVision (Li et al., 2017a).
Two types of label noise are used in the experiments: symmetric and asymmetric.
An 18-layer PreAct Resnet (He et al., 2016) is trained using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. The network is trained for 300 epochs. The initial learning rate was set to 0.02 and reduced by a factor of 10 after 150 epochs. Before applying the Co-divide and MixMatch strategies, the models were first independently trained over the entire dataset using cross-entropy loss during a "warm-up" period. Initially, training the models in this way prepares a more regular distribution of losses to improve upon in subsequent epochs. The warm-up period is 10 epochs for CIFAR-10 and 30 epochs for CIFAR-100. For all CIFAR experiments, we use the same hyperparameters M = 2, T = 0.5, and α = 4. τ is set as 0.5 except for 90% noise ratio when it is set as 0.6.

'''Comparison of State-of-the-Art Methods'''

The effectiveness of DivideMix was shown by comparing the test accuracy with the most recent state-of-the-art methods:
Meta-Learning (Li et al., 2019) proposes a gradient-based method to find model parameters that are more noise-tolerant;
Joint-Optim (Tanaka et al., 2018) and P-correction (Yi & Wu, 2019) jointly optimize the sample labels and the network parameters;
M-correction (Arazo et al., 2019) models sample loss with BMM and apply MixUp.
The following are the results on CIFAR-10 and CIFAR-100 with different levels of symmetric label noise ranging from 20% to 90%. Both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs were recorded in the following table:

[[File:divideMixtable1.PNG | center]]

From table 1, the author noticed that none of these methods can consistently outperform others across different datasets. M-correction excels at symmetric noise, whereas Meta-Learning performs better for asymmetric noise. DivideMix outperforms state-of-the-art methods by a large margin across all noise ratios. The improvement is substantial (∼10% of accuracy) for the more challenging CIFAR-100 with high noise ratios.

DivideMix was compared with the state-of-the-art methods with the other two datasets: Clothing1M and WebVision. It also shows that DivideMix consistently outperforms state-of-the-art methods across all datasets with different types of label noise. For WebVision, DivideMix achieves more than 12% improvement in top-1 accuracy.

'''Ablation Study'''

The effect of removing different components to provide insights into what makes DivideMix successful. We analyze the results in Table 5 as follows.

[[File:DivideMixtable5.PNG | center]]

The authors combined self-divide with the original MixMatch as a naive baseline for using SLL in LNL.
They also find that both label refinement and input augmentation are beneficial for DivideMix. ''Label refinement'' is important for high noise ratio due because samples that are noisier would be incorrectly divided into the labeled set. ''Augmentation'' upgrades model performance by creating more reliable predictions and by achieving consistent regularization. In addition, the performance drop was seen in the ''DivideMix w/o co-training'' highlights the disadvantage of self-training; the model still has dataset division, label refinement and label guessing, but they are all performed by the same model.

== Conclusion ==

This paper provides a new and effective algorithm for learning with noisy labels by using highly noisy data unlabelled data in a Semi-Supervised Learning framework. The DivideMix method trains two networks simultaneously and utilizes co-guessing and co-labeling effectively, therefore it is a robust approach to deal with noise in datasets. Also, the DivideMix method has been tested using various datasets with the results consistently being one of the best when compared to the state-of-the-art methods through extensive experiments.

Future work of DivideMix is to create an adaptation for other applications such as Natural Language Processing, and incorporating the ideas of SSL and LNL into DivideMix architecture.

== Critiques/ Insights ==

1. While combining both models makes the result better, the author did not show the relative time increase using this new combined methodology, which is very crucial considering training a large amount of data, especially for images. In addition, it seems that the author did not perform much on hyperparameters tuning for the combined model.

2. There is an interesting insight, which is when the noise ratio increases from 80% to 90%, the accuracy of DivideMix drops dramatically in both datasets.

3. There should be a further explanation of why the learning rate drops by a factor of 10 after 150 epochs.

4. It would be interesting to see the effectiveness of this method in other domains such as NLP. I am not aware of noisy training datasets available in NLP, but surely this is an important area to focus on, as much of the available data is collected from noisy sources from the web.

5. The paper implicitly assumes that a Gaussian mixture model (GMM) is sufficiently capable of identifying noise. Given the nature of a GMM, it would work well for noise that is distributed by a Gaussian distribution but for all other noise, it would probably be only asymptotic. The paper should present theoretical results on the noise that are Exponential, Rayleigh, etc. This is particularly important because the experiments were done on massive datasets, but they do not directly address the case when there are not many data points.

6. Comparing the training result on these benchmark datasets makes the algorithm quite comprehensive. This is a very insightful idea to maintain two networks to avoid bias from occurring.

7. The current benchmark accuracy for CIFAR-10 is 99.7, CIFAR-100 is 96.08 using EffNet-L2 in 2020. In 2019, CIFAR-10 is 99.37, CIFAR-100 is 93.51 using BiT-L.(based on paperswithcode.com) As there exists better methods, it would be nice to know why the authors chose these state-of-the-art methods to compare the test accuracy.

8. Another interesting observation is that DivideMix seems to maintain a similar accuracy while some methods give unstable results. That shows the reliability of the proposed algorithm.

9. It would be interesting to see if the drop in accuracy from increasing the noise ratio to 90% is a result of a low porportion or low number of clean labels. That is, would increasing the size of the training set but keeping the noise ratio at 90% result in increased accuracy?

10. For Ablation Study part, the paper also introduced a study on the Robustness of Testing Marking Methods Noise, including AUC for classification of clean/noisy samples of CIFAR-10 training data. And it shows that the method can effectively separate clean and noisy samples as training proceeds.

11. It is interesting how unlike common methods, the method in this paper discards the labels that are highly likely to be
noisy. It also utilizes the noisy samples as unlabeled data to regularize training in a SSL manner. This model can better distinguish and utilize noisy samples.

12. In the result section, the author gives us a comprehensive understanding of this algorithm by introducing the applications and the comparison of it with respect to similar methods. It would be attractive if in the application part, the author could indicate how the application relative to our daily life.

13. High quality data is very important for training Machine learning systems. Preparing the data to train ML systems requires data annotations which are prone to errors and are time-consuming. It is interesting to note how paper 14 and this paper aims to approach this problem from different perspectives. Paper 14 introduces CSL algorithm that learns from confused or Noisy data to find the tasks associated with them. And this paper proposes an algorithm that shows good performance when learning from noisy data. Hence both the papers seem to tackle similar problem and implementing the approaches described in both the papers when handling noisy data can be twice helpful.

14. Noise exists in all big data, and big data is what we are dealing with in real life nowadays. Having an effective noise eliminating method such as Dividemix is important to us.

15. The DivideMix consistently outperforms state-of-the-art methods across the given datasets, but how about some other potential datasets? If it can be given that it has advantages for a certain type of potential dataset, it will be a better discussion.

== References ==
[1] Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Unsupervised
label noise modeling and loss correction. In ICML, pp. 312–321, 2019.

[2] David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin
Raffel. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 2019.

[3] Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach
to learning from noisy labels. In WACV, pp. 1215–1224, 2018.

Music Recommender System Based using CRNN

2020-12-04T04:00:31Z

Z42qin: /* Modeling Neural Networks */

==Introduction and Objective:==

In the digital era of music streaming, companies, such as Spotify and Pandora, are faced with the following challenge: can they provide users with relevant and personalized music recommendations amidst the ever-growing abundance of music and user data.

The objective of this paper is to implement a personalized music recommender system that takes user listening history as input and continually finds new music that captures individual user preferences.

This paper argues that a music recommendation system should vary from the general recommendation system used in practice since it should combine music feature recognition and audio processing technologies to extract music features, and combine them with data on user preferences.

The authors of this paper took a content-based music approach to build the recommendation system - specifically, comparing the similarity of features based on the audio signal.

The following two-method approach for building the recommendation system was followed:
#Make recommendations including genre information extracted from classification algorithms.
#Make recommendations without genre information.

The authors used convolutional recurrent neural networks (CRNN), which is a combination of convolutional neural networks (CNN) and recurrent neural network(RNN), as their main classification model.

==Methods and Techniques:==
Collaborative filtering method is used by many music recommendation systems, which was explicitly based on ratings of a song from users to find similar users or items. However, people tend not to rate a song when listening to music. Nowadays, deep learning techniques are employed to music recommendation systems. Content based recommendation systems use item content to find similarities between items and the recommended items for users.

Generally, a music recommender can be divided into three main parts: (I) users, (ii) items, and (iii) user-item matching algorithms. First, we generated users' music tastes based on their profiles. Second, item profiling includes editorial, cultural, and acoustic metadata and they were collected for listeners' satisfaction. Finally, we come to the matching algorithm that suggests recommended personalized music to listeners.

To classify music, the original music’s audio signal is converted into a spectrogram image. Using the image and the Short Time Fourier Transform (STFT), we convert the data into the Mel scale which is used in the CNN and CRNN models.
=== Mel Scale: ===
The scale of pitches that are heard by listeners, which translates to equal pitch increments.

[[File:Mel.png|frame|none|Mel Scale on Spectrogram]]

=== Short Time Fourier Transform (STFT): ===
The transformation that determines the sinusoidal frequency of the audio, with a Hanning smoothing function. In the continuous case this is written as: <math>\mathbf{STFT}\{x(t)\}(\tau,\omega) \equiv X(\tau, \omega) = \int_{-\infty}^{\infty} x(t) w(t-\tau) e^{-i \omega t} \, d t </math>

where: <math>w(\tau)</math> is the Hanning smoothing function. The STFT is applied over a specified window length at a certain time allowing the frequency to represented for that given window rather than the entire signal as a typical Fourier Transform would.

=== Convolutional Neural Network (CNN): ===
A Convolutional Neural Network is a Neural Network that uses convolution in place of matrix multiplication for some layer calculations. By training the data, weights for inputs are updated to find the most significant data relevant to classification. These convolutional layers gather small groups of data with kernels and try to find patterns that can help find features in the overall data. The features are then used for classification. Padding is another technique used to extend the pixels on the edge of the original image to allow the kernel to more accurately capture the borderline pixels. Padding is also used if one wishes the convolved output image to have a certain size. The image on the left represents the mathematical expression of a convolution operation, while the right image demonstrates an application of a kernel on the data.

[[File:Convolution.png|thumb|400px|left|Convolution Operation]]
[[File:PaddingKernels.png|thumb|400px|center|Example of Padding (white 0s) and Kernels (blue square)]]

=== Convolutional Recurrent Neural Network (CRNN): ===
The CRNN is similar to the architecture of a CNN, but with the addition of a GRU, which is a Recurrent Neural Network (RNN). An RNN is used to treat sequential data, by reusing the activation function of previous nodes to update the output. A Gated Recurrent Unit (GRU) is used to store more long-term memory and will help train the early hidden layers. GRUs can be thought of as LSTMs but with a forget gate, and has fewer parameters than an LSTM. These gates are used to determine how much information from the past should be passed along onto the future. They are originally aimed to prevent the vanishing gradient problem, since deeper networks will result in smaller and smaller gradients at each layer. The GRU can choose to copy over all the information in the past, thus eliminating the risk of vanishing gradients.

[[File:GRU441.png|thumb|400px|left|Gated Recurrent Unit (GRU)]]
[[File:Recurrent441.png|thumb|400px|center|Diagram of General Recurrent Neural Network]]

==Data Screening:==

The authors of this paper used a publicly available music dataset made up of 25,000 30-second songs from the Free Music Archives which contains 16 different genres. The data is cleaned up by removing low audio quality songs, wrongly labelled genres and those that have multiple genres. To ensure a balanced dataset, only 1000 songs each from the genres of classical, electronic, folk, hip-hop, instrumental, jazz and rock were used in the final model.

[[File:Data441.png|thumb|200px|none|Data sorted by music genre]]

==Implementation:==

=== Modeling Neural Networks ===

As noted previously, both CNNs and CRNNs were used to model the data. The advantage of CRNNs is that they are able to model time sequence patterns in addition to frequency features from the spectrogram, allowing for greater identification of important features. Furthermore, feature vectors produced before the classification stage could be used to improve accuracy.

In implementing the neural networks, the Mel-spectrogram data was split up into training, validation, and test sets at a ratio of 8:1:1 respectively and labelled via one-hot encoding. This made it possible for the categorical data to be labelled correctly for binary classification. As opposed to classical stochastic gradient descent, the authors opted to use binary classier and ADAM optimization to update weights in the training phase, and parameters of <math>\alpha = 0.001, \beta_1 = 0.9, \beta_2 = 0.999<\math>. Binary cross-entropy was used as the loss function.
Input spectrogram image are 96x1366. In both the CNN and CRNN models, the data was trained over 100 epochs and batch size of 50 (limited computing power) with a binary cross-entropy loss function. Notable model specific details are below:

'''CNN'''
* Five convolutional layers with 3x3 kernel, stride 1, padding, batch normalization, and ReLU activation
* Max pooling layers
* The sigmoid function was used as the output layer

'''CRNN'''
* Four convolutional layers with 3x3 kernel (which construct a 2D temporal pattern - two layers of RNNs with Gated Recurrent Units), stride 1, padding, batch normalization, ReLU activation, and dropout rate 0.1
* Feature maps are N x1x15 (N = number of features maps, 68 feature maps in this case) is used for RNNs.
* 4 Max pooling layers for four convolutional layers with kernel ((2x2)-(3x3)-(4x4)-(4x4)) and same stride
* The sigmoid function was used as the output layer

The CNN and CRNN architecture is also given in the charts below.

[[File:CNN441.png|thumb|800px|none|Implementation of CNN Model]]
[[File:CRNN441.png|thumb|800px|none|Implementation of CRNN Model]]

=== Music Recommendation System ===

The recommendation system is computed by the cosine similarity of the extraction features from the neural network. Each genre will have a song act as a centre point for each class. The final inputs of the trained neural networks will be the feature variables. The featured variables will be used in the cosine similarity to find the best recommendations.

The values are between [-1,1], where larger values are songs that have similar features. When the user inputs five songs, those songs become the new inputs in the neural networks and the features are used by the cosine similarity with other music. The largest five cosine similarities are used as recommendations.
[[File:Cosine441.png|frame|100px|none|Cosine Similarity]]

== Evaluation Metrics ==
=== Precision: ===
* The proportion of True Positives with respect to the '''predicted''' positive cases (true positives and false positives)
* For example, out of all the songs that the classifier '''predicted''' as Classical, how many are actually Classical?
* Describes the rate at which the classifier predicts the true genre of songs among those predicted to be of that certain genre

=== Recall: ===
* The proportion of True Positives with respect to the '''actual''' positive cases (true positives and false negatives)
* For example, out of all the songs that are '''actually''' Classical, how many are correctly predicted to be Classical?
* Describes the rate at which the classifier predicts the true genre of songs among the correct instances of that genre

=== F1-Score: ===
An accuracy metric that combines the classifier’s precision and recall scores by taking the harmonic mean between the two metrics:

[[File:F1441.png|frame|100px|none|F1-Score]]

=== Receiver operating characteristics (ROC): ===
* A graphical metric that is used to assess a classification model at different classification thresholds
* In the case of a classification threshold of 0.5, this means that if <math>P(Y = k | X = x) > 0.5</math> then we classify this instance as class k
* Plots the true positive rate versus false positive rate as the classification threshold is varied

[[File:ROCGraph.jpg|thumb|400px|none|ROC Graph. Comparison of True Positive Rate and False Positive Rate]]

=== Area Under the Curve (AUC) ===
AUC is the area under the ROC in doing so, the ROC provides an aggregate measure across all possible classification thresholds.

In the context of the paper: When scoring all songs as <math>Prob(Classical | X=x)</math>, it is the probability that the model ranks a random Classical song at a higher probability than a random non-Classical song.

[[File:AUCGraph.jpg|thumb|400px|none|Area under the ROC curve.]]

== Results ==
=== Accuracy Metrics ===
The table below is the accuracy metrics with the classification threshold of 0.5.

[[File:TruePositiveChart.jpg|thumb|none|True Positive / False Positive Chart]]
On average, CRNN outperforms CNN in true positive and false positive cases. In addition, it is very apparent that false positives are much more frequent for songs in the Instrumental genre, perhaps indicating that more pre-processing needs to be done for songs in this genre or that it should be excluded from analysis completely given how most music has instrumental components.

[[File:F1Chart441.jpg|thumb|400px|none|F1 Chart]]
On average, CRNN outperforms CNN in F1-score.

[[File:AUCChart.jpg|thumb|400px|none|AUC Chart]]
On average, CRNN also outperforms CNN in AUC metric.

CRNN models that consider the frequency features and time sequence patterns of songs have a better classification performance through metrics such as F1 score and AUC when comparing to CNN classifier.

=== Evaluation of Music Recommendation System: ===

* A listening experiment was performed with 30 participants to access user responses to given music recommendations.
* Participants choose 5 pieces of music they enjoyed and the recommender system generated 5 new recommendations. The participants then evaluated the music recommendation by recording whether the song was liked or disliked.
* The recommendation system takes two approaches to the recommendation:
** Method one uses only the value of cosine similarity.
** Method two uses the value of cosine similarity and information on music genre.
*Perform test of significance of differences in average user likes between the two methods using a t-statistic:
[[File:H0441.png|frame|100px|none|Hypothesis test between method 1 and method 2]]

Comparing the two methods, <math> H_0: u_1 - u_2 = 0</math>, we have <math> t_{stat} = -4.743 < -2.037 </math>, which demonstrates that the increase in average user likes with the addition of music genre information is statistically significant.

== Conclusion: ==

Here are two main conclusions obtained from this paper:

- To increase the predictive capabilities of the music recommendation system, the music genre should be a key feature to analyze.

- To extract the song genre from a song’s audio signals and get overall better performance, CRNN’s are superior to CNN’s as they consider frequency in features and time sequence patterns of audio signals.

According to analyses in the paper, the authors also suggested adding other music features like tempo gram for capturing local tempo to improve the accuracy of the recommender system.

== Critiques/ Insights: ==
# The authors fail to give reference to the performance of current recommendation algorithms used in the industry; my critique would be for the authors to bench-mark their novel approach with other recommendation algorithms such as collaborative filtering to see if there is a lift in predictive capabilities.
# The listening experiment used to evaluate the recommendation system only includes songs that are outputted by the model. Users may be biased if they believe all songs have come from a recommendation system. To remove bias, we suggest having 15 songs where 5 songs are recommended and 10 songs are set. With this in the user’s mind, it may remove some bias in response and give more accurate predictive capabilities.
# They could go into more details about how CRNN makes it perform better than CNN, in terms of attributes of each network.
# The methodology introduced in this paper is probably also suitable for movie recommendations. As music is presented as spectrograms (images) in a time sequence, and it is very similar to a movie.
# The way of evaluation is a very interesting approach. Since it's usually not easy to evaluate the testing result when it's subjective. By listing all these evaluations' performance, the result would be more comprehensive. A practice that might reduce bias is by coming back to the participants after a couple of days and asking whether they liked the music that was recommended. Often times music "grows" on people and their opinion of a new song may change after some time has passed.
# The paper lacks the comparison between the proposed algorithm and the music recommendation algorithms being used now. It will be clearer to show the superiority of this algorithm.
# The GAN neural network has been proposed to enhance the performance of the neural network, so an improved result may appear after considering using GAN.
# The limitation of CNN and CRNN could be that they are only able to process the spectrograms with single labels rather than multiple labels. This is far from enough for the music recommender systems in today's music industry since the edges between various genres are blurred.
# Is it possible for CNN and CRNN to identify different songs? The model would be harder to train, based on my experience, the efficiency of CNN in R is not very high, which can be improved for future work.
# according to the author, the recommender system is done by calculating the cosine similarity of extraction features from one music to another music. Is possible to represent it by Euclidean distance or p-norm distances?
# In real-life application, most of the music software will have the ability to recommend music to the listener and ask do they like the music that was recommended. It would be a nice application by involving some new information from the listener.
# This paper is very similar to another [https://link.springer.com/chapter/10.1007/978-3-319-46131-1_29 paper], written by Bruce Fewerda and Markus Schedl. Both papers are suggesting methods of building music recommendation systems. However, this paper recommends music based on genre, but the paper written by Fewerda and Schedl suggests a personality-based user modeling for music recommender systems.
# Actual music listeners do not listen to one genre of music, and in fact listening to the same track or the same genre would be somewhat unusual. Could this method be used to make recommendations not on genre, but based on other categories? (Such as the theme of the lyrics, the pitch of the singer, or the date published). Would this model be able to differentiate between tracks of varying "lyric vocabulation difficulty"? Or would NLP algorithms be needed to consider lyrics?
# This model can be applied to many other fields such as recommending the news in the news app, recommending things to buy in the amazon, recommending videos to watch in YOUTUBE and so on based on the user information.
# Looks like for the most genres, CRNN outperforms CNN, but CNN did do better on a few genres (like Jazz), so it might be better to mix them together or might use CNN for some genres and CRNN for the rest.
# Cosine similarity is used to find songs with similar patterns as the input ones from users. That is, feature variables are extracted from the trained neural network model before the classification layer, and used as the basis to find similar songs. One potential problem of this approach is that if the neural network classifies an input song incorrectly, the extracted feature vector will not be a good representation of the input song. Thus, a song that is in fact really similar to the input song may have a small cosine similarity value, i.e. not be recommended. In conclusion, if the first classification is wrong, future inferences based on that is going to make it deviate further from the true answer. A possible future improvement will be how to offset this inference error.
# In the tables when comparing performance and accuracies of the CNN and CRNN models on different genres of music, the researchers claimed that CRNN had superior performance to CNN models. This seemed intuitive, especially in the cases when the differences in accuracies were large. However, maybe the researchers should consider including some hypothesis testing statistics in such tables, which would support such claims in a more rigorous manner.
# A music recommender system that doesn't use the song's meta data such as artist and genre and rather tries to classify genre itself seems unproductive. I also believe that the specific artist matters much more than the genre since within a genre you have many different styles. It just seems like the authors hamstring their recommender system by excluding other relevant data.
# The genres that are posed in the paper are very broad and may not be specific enough to distinguish a listeners actual tastes (ie, I like rock and roll, but not punk rock, which could both be in the "rock" category). It would be interesting to run similar experiments with more concrete and specific genres to study the possibility of improving accuracy in the model.
# This summary is well organized with detailed explanation to the music recommendation algorithm. However, since the data used in this paper is cleaned to buffer the efficiency of the recommendation, there should be a section evaluating the impact of noise on the performance this algorithm and how to minimize the impact.
# This method will be better if the user choose some certain music genres that they like while doing the sign-up process. This is similar to recommending articles on twitter.
# In the

Efficient kNN Classification with Different Numbers of Nearest Neighbors

2020-12-04T03:43:04Z

Z42qin: /* Reconstruction */

== Presented by ==
Cooper Brooke, Daniel Fagan, Maya Perelman

== Introduction ==
Traditional model-based classification approaches first use training observations to fit a model before predicting test samples. In contrast, the model-free k-nearest neighbors (KNNs) method classifies observations with a majority rule approach, labeling each piece of test data based on its k closest training observations (neighbors). This method has become very popular due to its strong performance and simple implementation.

There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples, and the second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.

== Previous Work ==

Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications.

Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.

== Motivation ==

Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seeks to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.

A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.

== Approach ==

=== kTree Classification ===

The proposed kTree method is illustrated by the following flow chart:

[[File:Approach_Figure_1.png | center | 800x800px]]

==== Reconstruction ====

The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}\in \mathbb{R}^{d\times n} = [x_1,...,x_n]</math> represents the training set can be written as:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2
\end{aligned}$$

In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples.

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho||\textbf{W}||^2_2
\end{aligned}$$

This is called ridge regression and it has a close solution where $$W = (X^TX+\rho I)^{-1}X^TX$$

However, this objective function does not provide a sparse result, there we further employe a sparse objective function:

$$W = (X^TX+\rho I)^{-1}X^TX, W >= 0$$

The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. It is penalized with the function:

$$\frac{1}{2} \sum^{d}_{i,j} ||x^iW-x^jW||^2_2$$

with sij denotes the relation between feature vectors. It uses a radial basis function kernel to calculate Sij. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:

$$\begin{aligned}
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})
\end{aligned}$$

where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features. The Laplacian matrix, also called the graph Laplacian, is a matrix representation of a graph.

This gives a final objective function of:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})
\end{aligned}$$

Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.

==== Calculate ''k'' for training set ====

Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.

For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:

[[File:Approach_Figure_2.png | center | 300x300px]]

The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 are non-zero.

==== Train a Decision Tree using ''k'' as the label ====

In a normal decision tree, the target data is the labels themselves. In contrast, in the kTree method, the target data is the optimal ''k'' value for each sample that was solved for in the previous step. So this decision tree has the following form:

[[File:Approach_Figure_3.png | center | 300x300px]]

==== Making Predictions for Test Data ====

The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbors across '''all''' of the training data.

=== k*Tree Classification ===

The proposed k*Tree method is illustrated by the following flow chart:

[[File:Approach_Figure_4.png | center | 1000x1000px]]

Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.

While all steps previous are the exact same, the difference comes from additional data stored in the leaf nodes. k*Tree method not only stores the optimal ''k'' value but also the following information:

* The training samples that have the same optimal ''k''
* The ''k'' nearest neighbours of the previously identified training samples
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours

The data stored in each node is summarized in the following figure:

[[File:Approach_Figure_5.png | center | 800x800px]]

When testing, the constructed k*Tree is searched for its optimal k values well as its nearest neighbours in the leaf node. It then selects a number of its nearest neighbours from the subset of training samples and assigns the test sample with the majority label of these nearest neighbours.

In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.

== Experiments ==

In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:

# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.
# kNN-Based Applicability Domain Approach (AD-kNN) [11]
# kNN Method Based on Sparse Learning (S-kNN) [10]
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]

The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features.

'''A. Experimental Results on Different Sample Sizes'''

The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.

[[File:Table_I_kNN.png | center | 1000x1000px]]

The following key results are noted:
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.

'''B. Experimental Results on Different Feature Numbers'''

The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score, an algorithm that solves maximum likelihood equations numerically [15], was used to rank and select the most information features in the datasets.

[[File:Table_II_kNN.png | center | 1000x1000px]]

From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.

== Conclusion ==

This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods can classify the test samples efficiently and effectively, by designing a training step that reduces the run time of the test stage and thus enhances the performance. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features.

== Critiques ==

*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy.
* The authors addressed that some of the UCI datasets contained imbalanced data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.).
*While the authors contrast their kTree and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.

* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.

* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.

* It would be really helpful if kTrees method can be explained at the very beginning. The transition from KNN to kTrees is not very smooth.

* It would be nice to have a comparison of the running costs of different methods to see how much faster kTree and k*Tree performed

* It would be better to show the key result only on a summary rather than stacking up all results without screening.

* In the results section, it was mentioned that in the experiment on data sets with different numbers of features, the kTree and k*Tree model did not achieve GS-kNN or S-kNN's accuracies, but was faster in terms of running cost. It might be helpful here if the authors add some more supporting arguments about the benefit of this tradeoff, which appears to be a minor decrease in accuracy for a large improvement in speed. This could further showcase the advantages of the kTree and k*Tree models. More quantitative analysis or real-life scenario examples could be some choices here.

* An interesting thing to notice while solving for the optimal matrix <math>W^*</math> that minimizes the loss function is that <math>W^*</math> is not necessarily a symmetric matrix. That is, the correlation between the <math>i^{th}</math> entry and the <math>j^{th}</math> entry is different from that between the <math>j^{th}</math> entry and the <math>i^{th}</math> entry, which makes the resulting W* not really semantically meaningful. Therefore, it would be interesting if we may set a threshold on the allowing difference between the <math>ij^{th}</math> entry and the <math>ji^{th}</math> entry in <math>W^*</math> and see if this new configuration will give better or worse results compared to current ones, which will provide better insights of the algorithm.

* It would be interesting to see how the proposed model works with highly non-linear datasets. In the event it does not work well, it would pose the question: would replacing the k*Tree with a SVM or a neural network improve the accuracy? There could be experiments to show if this variant would prove superior over the original models.

* The key results are a little misleading - for example they claim "the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method" is false. The kTree method had slightly lower accuracy than both GS-kNN and S-kNN and kTree was also slower than LC-kNN

== References ==

[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.

[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.

[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.

[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.

[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.

[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.

[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.

[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.

[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.

[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.

[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.

[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.

[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.

Efficient kNN Classification with Different Numbers of Nearest Neighbors

2020-12-04T03:42:39Z

Z42qin: /* kTree Classification */

== Presented by ==
Cooper Brooke, Daniel Fagan, Maya Perelman

== Introduction ==
Traditional model-based classification approaches first use training observations to fit a model before predicting test samples. In contrast, the model-free k-nearest neighbors (KNNs) method classifies observations with a majority rule approach, labeling each piece of test data based on its k closest training observations (neighbors). This method has become very popular due to its strong performance and simple implementation.

There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples, and the second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.

== Previous Work ==

Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications.

Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.

== Motivation ==

Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seeks to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.

A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.

== Approach ==

=== kTree Classification ===

The proposed kTree method is illustrated by the following flow chart:

[[File:Approach_Figure_1.png | center | 800x800px]]

==== Reconstruction ====

The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}\in \mathbb{R}^{d\times n} = [x_1,...,x_n]</math> represents the training set can be written as:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2
\end{aligned}$$

In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples.

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho||\textbf{W}||^2_2
\end{aligned}$$

This is called ridge regression and it has a close solution where $$W = (X^TX+\rho I)^{-1}X^TX$$

However, this objective function does not provide a sparse result, there we further employe a sparse objective function:

$$W = (X^TX+\rho I)^{-1}X^TX, W >= 0$$

The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. It is penalized with the function:

$$\frac{1}{2} \sum{d}{i,j} ||x^iW-x^jW||^2_2$$

with sij denotes the relation between feature vectors. It uses a radial basis function kernel to calculate Sij. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:

$$\begin{aligned}
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})
\end{aligned}$$

where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features. The Laplacian matrix, also called the graph Laplacian, is a matrix representation of a graph.

This gives a final objective function of:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})
\end{aligned}$$

Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.

==== Calculate ''k'' for training set ====

Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.

For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:

[[File:Approach_Figure_2.png | center | 300x300px]]

The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 are non-zero.

==== Train a Decision Tree using ''k'' as the label ====

In a normal decision tree, the target data is the labels themselves. In contrast, in the kTree method, the target data is the optimal ''k'' value for each sample that was solved for in the previous step. So this decision tree has the following form:

[[File:Approach_Figure_3.png | center | 300x300px]]

==== Making Predictions for Test Data ====

The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbors across '''all''' of the training data.

=== k*Tree Classification ===

The proposed k*Tree method is illustrated by the following flow chart:

[[File:Approach_Figure_4.png | center | 1000x1000px]]

Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.

While all steps previous are the exact same, the difference comes from additional data stored in the leaf nodes. k*Tree method not only stores the optimal ''k'' value but also the following information:

* The training samples that have the same optimal ''k''
* The ''k'' nearest neighbours of the previously identified training samples
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours

The data stored in each node is summarized in the following figure:

[[File:Approach_Figure_5.png | center | 800x800px]]

When testing, the constructed k*Tree is searched for its optimal k values well as its nearest neighbours in the leaf node. It then selects a number of its nearest neighbours from the subset of training samples and assigns the test sample with the majority label of these nearest neighbours.

In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.

== Experiments ==

In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:

# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.
# kNN-Based Applicability Domain Approach (AD-kNN) [11]
# kNN Method Based on Sparse Learning (S-kNN) [10]
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]

The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features.

'''A. Experimental Results on Different Sample Sizes'''

The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.

[[File:Table_I_kNN.png | center | 1000x1000px]]

The following key results are noted:
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.

'''B. Experimental Results on Different Feature Numbers'''

The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score, an algorithm that solves maximum likelihood equations numerically [15], was used to rank and select the most information features in the datasets.

[[File:Table_II_kNN.png | center | 1000x1000px]]

From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.

== Conclusion ==

This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods can classify the test samples efficiently and effectively, by designing a training step that reduces the run time of the test stage and thus enhances the performance. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features.

== Critiques ==

*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy.
* The authors addressed that some of the UCI datasets contained imbalanced data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.).
*While the authors contrast their kTree and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.

* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.

* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.

* It would be really helpful if kTrees method can be explained at the very beginning. The transition from KNN to kTrees is not very smooth.

* It would be nice to have a comparison of the running costs of different methods to see how much faster kTree and k*Tree performed

* It would be better to show the key result only on a summary rather than stacking up all results without screening.

* In the results section, it was mentioned that in the experiment on data sets with different numbers of features, the kTree and k*Tree model did not achieve GS-kNN or S-kNN's accuracies, but was faster in terms of running cost. It might be helpful here if the authors add some more supporting arguments about the benefit of this tradeoff, which appears to be a minor decrease in accuracy for a large improvement in speed. This could further showcase the advantages of the kTree and k*Tree models. More quantitative analysis or real-life scenario examples could be some choices here.

* An interesting thing to notice while solving for the optimal matrix <math>W^*</math> that minimizes the loss function is that <math>W^*</math> is not necessarily a symmetric matrix. That is, the correlation between the <math>i^{th}</math> entry and the <math>j^{th}</math> entry is different from that between the <math>j^{th}</math> entry and the <math>i^{th}</math> entry, which makes the resulting W* not really semantically meaningful. Therefore, it would be interesting if we may set a threshold on the allowing difference between the <math>ij^{th}</math> entry and the <math>ji^{th}</math> entry in <math>W^*</math> and see if this new configuration will give better or worse results compared to current ones, which will provide better insights of the algorithm.

* It would be interesting to see how the proposed model works with highly non-linear datasets. In the event it does not work well, it would pose the question: would replacing the k*Tree with a SVM or a neural network improve the accuracy? There could be experiments to show if this variant would prove superior over the original models.

* The key results are a little misleading - for example they claim "the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method" is false. The kTree method had slightly lower accuracy than both GS-kNN and S-kNN and kTree was also slower than LC-kNN

== References ==

[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.

[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.

[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.

[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.

[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.

[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.

[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.

[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.

[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.

[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.

[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.

[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.

[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.

Efficient kNN Classification with Different Numbers of Nearest Neighbors

2020-12-04T03:42:14Z

Z42qin: /* Reconstruction */

== Presented by ==
Cooper Brooke, Daniel Fagan, Maya Perelman

== Introduction ==
Traditional model-based classification approaches first use training observations to fit a model before predicting test samples. In contrast, the model-free k-nearest neighbors (KNNs) method classifies observations with a majority rule approach, labeling each piece of test data based on its k closest training observations (neighbors). This method has become very popular due to its strong performance and simple implementation.

There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples, and the second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.

== Previous Work ==

Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications.

Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.

== Motivation ==

Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seeks to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.

A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.

== Approach ==

=== kTree Classification ===

The proposed kTree method is illustrated by the following flow chart:

[[File:Approach_Figure_1.png | center | 800x800px]]

==== Reconstruction ====

The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}\in \mathbb{R}^{d\times n} = [x_1,...,x_n]</math> represents the training set can be written as:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2
\end{aligned}$$

In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples.

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho||\textbf{W}||^2_2
\end{aligned}$$

This is called ridge regression and it has a close solution where $$W = (X^TX+\rhoI)^{-1}X^TX$$

However, this objective function does not provide a sparse result, there we further employe a sparse objective function:

$$W = (X^TX+\rhoI)^{-1}X^TX, W >= 0$$

The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. It is penalized with the function:

$$\frac{1}{2} \sum{d}{i,j} ||x^iW-x^jW||^2_2$$

with sij denotes the relation between feature vectors. It uses a radial basis function kernel to calculate Sij. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:

$$\begin{aligned}
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})
\end{aligned}$$

where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features. The Laplacian matrix, also called the graph Laplacian, is a matrix representation of a graph.

This gives a final objective function of:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})
\end{aligned}$$

Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.

==== Calculate ''k'' for training set ====

Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.

For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:

[[File:Approach_Figure_2.png | center | 300x300px]]

The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 are non-zero.

==== Train a Decision Tree using ''k'' as the label ====

In a normal decision tree, the target data is the labels themselves. In contrast, in the kTree method, the target data is the optimal ''k'' value for each sample that was solved for in the previous step. So this decision tree has the following form:

[[File:Approach_Figure_3.png | center | 300x300px]]

==== Making Predictions for Test Data ====

The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbors across '''all''' of the training data.

=== k*Tree Classification ===

The proposed k*Tree method is illustrated by the following flow chart:

[[File:Approach_Figure_4.png | center | 1000x1000px]]

Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.

While all steps previous are the exact same, the difference comes from additional data stored in the leaf nodes. k*Tree method not only stores the optimal ''k'' value but also the following information:

* The training samples that have the same optimal ''k''
* The ''k'' nearest neighbours of the previously identified training samples
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours

The data stored in each node is summarized in the following figure:

[[File:Approach_Figure_5.png | center | 800x800px]]

When testing, the constructed k*Tree is searched for its optimal k values well as its nearest neighbours in the leaf node. It then selects a number of its nearest neighbours from the subset of training samples and assigns the test sample with the majority label of these nearest neighbours.

In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.

== Experiments ==

In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:

# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.
# kNN-Based Applicability Domain Approach (AD-kNN) [11]
# kNN Method Based on Sparse Learning (S-kNN) [10]
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]

The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features.

'''A. Experimental Results on Different Sample Sizes'''

The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.

[[File:Table_I_kNN.png | center | 1000x1000px]]

The following key results are noted:
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.

'''B. Experimental Results on Different Feature Numbers'''

The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score, an algorithm that solves maximum likelihood equations numerically [15], was used to rank and select the most information features in the datasets.

[[File:Table_II_kNN.png | center | 1000x1000px]]

From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.

== Conclusion ==

This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods can classify the test samples efficiently and effectively, by designing a training step that reduces the run time of the test stage and thus enhances the performance. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features.

== Critiques ==

*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy.
* The authors addressed that some of the UCI datasets contained imbalanced data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.).
*While the authors contrast their kTree and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.

* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.

* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.

* It would be really helpful if kTrees method can be explained at the very beginning. The transition from KNN to kTrees is not very smooth.

* It would be nice to have a comparison of the running costs of different methods to see how much faster kTree and k*Tree performed

* It would be better to show the key result only on a summary rather than stacking up all results without screening.

* In the results section, it was mentioned that in the experiment on data sets with different numbers of features, the kTree and k*Tree model did not achieve GS-kNN or S-kNN's accuracies, but was faster in terms of running cost. It might be helpful here if the authors add some more supporting arguments about the benefit of this tradeoff, which appears to be a minor decrease in accuracy for a large improvement in speed. This could further showcase the advantages of the kTree and k*Tree models. More quantitative analysis or real-life scenario examples could be some choices here.

* An interesting thing to notice while solving for the optimal matrix <math>W^*</math> that minimizes the loss function is that <math>W^*</math> is not necessarily a symmetric matrix. That is, the correlation between the <math>i^{th}</math> entry and the <math>j^{th}</math> entry is different from that between the <math>j^{th}</math> entry and the <math>i^{th}</math> entry, which makes the resulting W* not really semantically meaningful. Therefore, it would be interesting if we may set a threshold on the allowing difference between the <math>ij^{th}</math> entry and the <math>ji^{th}</math> entry in <math>W^*</math> and see if this new configuration will give better or worse results compared to current ones, which will provide better insights of the algorithm.

* It would be interesting to see how the proposed model works with highly non-linear datasets. In the event it does not work well, it would pose the question: would replacing the k*Tree with a SVM or a neural network improve the accuracy? There could be experiments to show if this variant would prove superior over the original models.

* The key results are a little misleading - for example they claim "the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method" is false. The kTree method had slightly lower accuracy than both GS-kNN and S-kNN and kTree was also slower than LC-kNN

== References ==

[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.

[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.

[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.

[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.

[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.

[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.

[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.

[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.

[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.

[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.

[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.

[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.

[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.

Efficient kNN Classification with Different Numbers of Nearest Neighbors

2020-12-04T03:41:43Z

Z42qin: /* Reconstruction */

== Presented by ==
Cooper Brooke, Daniel Fagan, Maya Perelman

== Introduction ==
Traditional model-based classification approaches first use training observations to fit a model before predicting test samples. In contrast, the model-free k-nearest neighbors (KNNs) method classifies observations with a majority rule approach, labeling each piece of test data based on its k closest training observations (neighbors). This method has become very popular due to its strong performance and simple implementation.

There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples, and the second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.

== Previous Work ==

Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications.

Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.

== Motivation ==

Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seeks to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.

A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.

== Approach ==

=== kTree Classification ===

The proposed kTree method is illustrated by the following flow chart:

[[File:Approach_Figure_1.png | center | 800x800px]]

==== Reconstruction ====

The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}\in \mathbb{R}^{d\times n} = [x_1,...,x_n]</math> represents the training set can be written as:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2
\end{aligned}$$

In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples.

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho||\textbf{W}||^2_2
\end{aligned}$$

This is called ridge regression and it has a close solution where $W = (X^TX+\rhoI)^{-1}X^TX$

However, this objective function does not provide a sparse result, there we further employe a sparse objective function:

$W = (X^TX+\rhoI)^{-1}X^TX, W >= 0$

The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. It is penalized with the function:

$\frac{1}{2} \sum{d}{i,j} ||x^iW-x^jW||^2_2$

with sij denotes the relation between feature vectors. It uses a radial basis function kernel to calculate Sij. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:

$$\begin{aligned}
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})
\end{aligned}$$

where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features. The Laplacian matrix, also called the graph Laplacian, is a matrix representation of a graph.

This gives a final objective function of:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})
\end{aligned}$$

Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.

==== Calculate ''k'' for training set ====

Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.

For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:

[[File:Approach_Figure_2.png | center | 300x300px]]

The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 are non-zero.

==== Train a Decision Tree using ''k'' as the label ====

In a normal decision tree, the target data is the labels themselves. In contrast, in the kTree method, the target data is the optimal ''k'' value for each sample that was solved for in the previous step. So this decision tree has the following form:

[[File:Approach_Figure_3.png | center | 300x300px]]

==== Making Predictions for Test Data ====

The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbors across '''all''' of the training data.

=== k*Tree Classification ===

The proposed k*Tree method is illustrated by the following flow chart:

[[File:Approach_Figure_4.png | center | 1000x1000px]]

Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.

While all steps previous are the exact same, the difference comes from additional data stored in the leaf nodes. k*Tree method not only stores the optimal ''k'' value but also the following information:

* The training samples that have the same optimal ''k''
* The ''k'' nearest neighbours of the previously identified training samples
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours

The data stored in each node is summarized in the following figure:

[[File:Approach_Figure_5.png | center | 800x800px]]

When testing, the constructed k*Tree is searched for its optimal k values well as its nearest neighbours in the leaf node. It then selects a number of its nearest neighbours from the subset of training samples and assigns the test sample with the majority label of these nearest neighbours.

In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.

== Experiments ==

In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:

# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.
# kNN-Based Applicability Domain Approach (AD-kNN) [11]
# kNN Method Based on Sparse Learning (S-kNN) [10]
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]

The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features.

'''A. Experimental Results on Different Sample Sizes'''

The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.

[[File:Table_I_kNN.png | center | 1000x1000px]]

The following key results are noted:
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.

'''B. Experimental Results on Different Feature Numbers'''

The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score, an algorithm that solves maximum likelihood equations numerically [15], was used to rank and select the most information features in the datasets.

[[File:Table_II_kNN.png | center | 1000x1000px]]

From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.

== Conclusion ==

This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods can classify the test samples efficiently and effectively, by designing a training step that reduces the run time of the test stage and thus enhances the performance. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features.

== Critiques ==

*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy.
* The authors addressed that some of the UCI datasets contained imbalanced data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.).
*While the authors contrast their kTree and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.

* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.

* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.

* It would be really helpful if kTrees method can be explained at the very beginning. The transition from KNN to kTrees is not very smooth.

* It would be nice to have a comparison of the running costs of different methods to see how much faster kTree and k*Tree performed

* It would be better to show the key result only on a summary rather than stacking up all results without screening.

* In the results section, it was mentioned that in the experiment on data sets with different numbers of features, the kTree and k*Tree model did not achieve GS-kNN or S-kNN's accuracies, but was faster in terms of running cost. It might be helpful here if the authors add some more supporting arguments about the benefit of this tradeoff, which appears to be a minor decrease in accuracy for a large improvement in speed. This could further showcase the advantages of the kTree and k*Tree models. More quantitative analysis or real-life scenario examples could be some choices here.

* An interesting thing to notice while solving for the optimal matrix <math>W^*</math> that minimizes the loss function is that <math>W^*</math> is not necessarily a symmetric matrix. That is, the correlation between the <math>i^{th}</math> entry and the <math>j^{th}</math> entry is different from that between the <math>j^{th}</math> entry and the <math>i^{th}</math> entry, which makes the resulting W* not really semantically meaningful. Therefore, it would be interesting if we may set a threshold on the allowing difference between the <math>ij^{th}</math> entry and the <math>ji^{th}</math> entry in <math>W^*</math> and see if this new configuration will give better or worse results compared to current ones, which will provide better insights of the algorithm.

* It would be interesting to see how the proposed model works with highly non-linear datasets. In the event it does not work well, it would pose the question: would replacing the k*Tree with a SVM or a neural network improve the accuracy? There could be experiments to show if this variant would prove superior over the original models.

* The key results are a little misleading - for example they claim "the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method" is false. The kTree method had slightly lower accuracy than both GS-kNN and S-kNN and kTree was also slower than LC-kNN

== References ==

[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.

[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.

[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.

[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.

[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.

[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.

[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.

[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.

[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.

[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.

[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.

[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.

[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.

Being Bayesian about Categorical Probability

2020-12-04T01:04:10Z