http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=A29mukhe&feedformat=atomstatwiki - User contributions [US]2022-01-25T21:01:09ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=48422Task Understanding from Confusing Multi-task Data2020-11-30T08:51:23Z<p>A29mukhe: /* Critique */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, assist doctors to make data-driven decisions, and even self-driving cars. One of the most famous integrated forms of Narrow AI is Apple's Siri. Siri has no self-awareness or genuine intelligence, and hence often has challenges performing tasks outside its range of abilities. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates an input to its respective task and the latter function maps the input to its label within the allocated tasks. See figure 1(b). To implement the CSL, we use two neural networks to represent the de-confusing function and mapping function respectively. However, simply combining the two functions or networks to a single architecture is impossible, since the one-hot constraint of the outputs for the de-confusing network makes the gradient back-propagation unfeasible. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization in the proposed architecture CLS-Net.<br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine assigned with complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, we know that samples are generated from multiple distinct distributions instead of one distribution combining a mixture of multiple probability models. Thus, the latent variable learning can not fully distinguish labels into different tasks and different distributions, and it is insufficient to classify the multi-task confusing samples. <br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved the overall learning efficiency, since the labels in different tasks are often correlated: improving the classfication result for one class also help with other classification tasks. In multi-task learning, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. If such manuual task annotation is abstent, then the algorithm can not be performed. <br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, let <math> (x,y)</math> be the training samples from <math>y=f(x)</math>, which is an identical but unknown mapping relationship. Assuming the risk measure is mean squared error (MSE), the expected risk function is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the data distribution of the input variable <math>x</math>. In practice, the methods select the optimal function by minimizing the empirical risk:<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
To minimize the risk function, the theoretically optimal solution is <math> f(x) </math>.<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk function can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math>. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. The solution represents a mixed probably model instead of knowing the exact tasks and their correpsonding individual probability distribution. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to the output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (using MSE loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
The risk metric of every sample affects only its assigned task.<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
However, necessity constraints are needed to avoid meaningless trivial solutions in all optimal risk solutions.<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
This theorem shows the method of empirical risk minimization is valid in the CSL framework. Moreover, the assumed number of tasks affects the VC dimension of the learning functions, which is positively related to the generalization error. Therefore, to make the training risk small, we need to choose the ''minimum number'' of tasks when determining the task.<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to reproduce both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of digit data in a range of 0 to 9, each of which is in a single color among the eight different colors. Each observation in this modified set consists of a colored image (<math>x_i</math>) and a label (<math>y_i</math>) that represents either the corresponding color, or the digit. The goal is to reproduce the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': The second image classification data set consists of several fashion-related objects labeled from any of the 3 criteria: “gender”, “category”, and “main color”, whose number of observations is larger than that of the "colored-MNIST" data set.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks. The CSL methods autonomously learned three tasks which corresponded exactly to “Gender”,<br />
“Category”, and “Color” as we see it.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
The purpose of this measure arises from the fact that, in addition to learning mapping allocations like humans, machines should be able to approximate all mapping functions accurately in order to provide corresponding labels. The Label Prediction Accuracy measure captures the exchange equivalence of the following task: each mapping contains its ground-truth output, and machines should be predicting the correct output that is close to the ground-truth. <br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': To "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well in learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labelling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
<br />
Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reactions from the text.<br />
<br />
Multi-label can be used to improve the syndrome diagnosis of a patient by focusing on multiple syndromes instead of a single syndrome.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem without manual task annotations from basic input data. The model obtains a basic task concept by learning the minimum risk for confusing samples from differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, some limitations can be improved for future work:<br />
<br />
- The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; <br />
<br />
- The current algorithm is difficult to learn basic features directly through a CNN structure and understand tasks simultaneously by training a full-connect network. However, this limitation does not affect the effectiveness of our algorithm in learning confusing data based on pre-trained features.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed to converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data. The validation tasks chosen by data are all very simple, and CSL is actually not necessary for those tasks. For the colored MNIST data, a simple function can be written to distinguish the color label from the number label. The same problem applied to the Kaggle Fashion product dataset. The candidate label can be easily classified into different tasks by some wording analysis or meaning classification program or even manual classification. Even though the idea discussed by authors are interesting, the examples suggested by authors seem to suggest very limited or even unnessary application. In most cases, it is more benefitial to treat the Confusing Multi-task Data problems separately into two distinct stages: we classify the tasks first according to the meaning of the label, and then we perform a multi-class/multi-label training process.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
When using this framework for classification, the order of the one-hot classification labels for each task will likely influence the relationships learned between each task, since the same output header is used for all tasks. This may be why this method fails to learn low-level representations and requires pretraining. I would like to see more explanation in the paper about why this isn't a problem if it was investigated.<br />
<br />
It would be a good idea to include comparison details in the summary to make the results and the conclusion more convincing. For instance, though the paper introduced the result generated using confusion data, and provide some applications for multi-label learning, these two sections still fell short and could use some technical details as supporting evidence.<br />
<br />
It is interesting to investigate if the order of adding tasks will influence the model performance.<br />
<br />
It would be interesting to see the effectiveness of applying CSL in face recognition, such that not only does the algorithm map the face to identity, it also categorizes the face based on other features like beard/no beard and glasses/no glasses simultaneously.<br />
<br />
For pattern recognition,pre-trained features were used in the algorithm. It would be interesting to see how the effectiveness of the model changes if we train it with data directly from the CNN structure in the future.<br />
<br />
So basically given a confused dataset CSL finds the important tasks or labels from the dataset as can be seen from the fruit example. In the example, fruits are grouped under their names, their tastes, and their color, when CSL is given a mixed dataset. Hence given an unstructured data, unlabeled, confused dataset CSL helps in finding the labels, which in turn can help in cleaning the dataset and further in preparing high-quality training data set which is very important in different ML algorithms. Since at present preparing these dataset requires manual data annotations, CSL can save time in that process.<br />
<br />
For the Colorful-Mnist data set, the goal is to understand the concept of multiple classification tasks from these examples. All inputs have multiple classification tasks. Each observed sample only represents the classification result of one task, and the task from which the sample comes is unknown.<br />
<br />
It would be nice to know why the given metrics of confusing supervised learning are used. The authors should have used several different metrics and show that CSL's overall performs better than other methods. And what are "the other methods" referring to?<br />
<br />
In the Iterative Deconfusing Algorithmn section, the Training of Mapping-Net needs more explanation. The authors should specify what it is doing before showing its equations.<br />
<br />
For the results section, it would be more intuitive and stronger if the author provide more detail on these two methods and add a plot to support the claim. Based on the text, it might not be an obvious comparison.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.<br />
<br />
[6] Guo-Ping Liu, Jian-Jun Yan, Yi-Qin Wang, Jing-Jing Fu, Zhao-Xia Xu, Rui Guo, Peng Qian, "Application of Multilabel Learning Using the Relevant Feature for Each Label in Chronic Gastritis Syndrome Diagnosis", Evidence-Based Complementary and Alternative Medicine, vol. 2012, Article ID 135387, 9 pages, 2012. https://doi.org/10.1155/2012/135387</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Efficient_kNN_Classification_with_Different_Numbers_of_Nearest_Neighbors&diff=48419Efficient kNN Classification with Different Numbers of Nearest Neighbors2020-11-30T08:42:06Z<p>A29mukhe: /* Train a Decision Tree using k as the label */</p>
<hr />
<div>== Presented by == <br />
Cooper Brooke, Daniel Fagan, Maya Perelman<br />
<br />
== Introduction == <br />
Unlike popular parametric approaches, which first utilizes training observations to learn a model and subsequently predict test samples with this model, the non-parametric k-nearest neighbors (kNNs) method classifies observations with a majority rules approach. kNN classifies test samples by assigning them the label of their k closest training observation (neighbours). This method has become very popular due to its significant performance and easy implementation. <br />
<br />
There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples. The second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.<br />
<br />
== Previous Work == <br />
<br />
Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications. <br />
<br />
Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.<br />
<br />
== Motivation == <br />
<br />
Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seek to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.<br />
<br />
A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.<br />
<br />
== Approach == <br />
<br />
<br />
=== kTree Classification ===<br />
<br />
The proposed kTree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_1.png | center | 800x800px]]<br />
<br />
==== Reconstruction ====<br />
<br />
The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}</math> represents the training set can be written as:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2<br />
\end{aligned}$$<br />
<br />
In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples. <br />
<br />
The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:<br />
<br />
$$\begin{aligned}<br />
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})<br />
\end{aligned}$$<br />
<br />
where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features.<br />
<br />
This gives a final objective function of:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})<br />
\end{aligned}$$<br />
<br />
Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.<br />
<br />
==== Calculate ''k'' for training set ====<br />
<br />
Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.<br />
<br />
For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:<br />
<br />
[[File:Approach_Figure_2.png | center | 300x300px]]<br />
<br />
The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 is non-zero.<br />
<br />
==== Train a Decision Tree using ''k'' as the label ====<br />
<br />
In a normal decision tree, the target data is the labels themselves. In contrast, in the kTree method, the target data is the optimal ''k'' values for each sample that were solved for in the previous step. So this decision tree has the following form:<br />
<br />
[[File:Approach_Figure_3.png | center | 300x300px]]<br />
<br />
==== Making Predictions for Test Data ====<br />
<br />
The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbours across '''all''' of the training data.<br />
<br />
=== k*Tree Classification ===<br />
<br />
The proposed k*Tree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_4.png | center | 800x800px]]<br />
<br />
Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.<br />
<br />
While all steps previous are the exact same, the k*Tree method not only stores the optimal ''k'' value but also the following information:<br />
<br />
* The training samples that have the same optimal ''k''<br />
* The ''k'' nearest neighbours of the previously identified training samples<br />
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours<br />
<br />
The data stored in each node is summarized in the following figure:<br />
<br />
[[File:Approach_Figure_5.png | center | 800x800px]]<br />
<br />
In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.<br />
<br />
== Experiments == <br />
<br />
In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:<br />
<br />
# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.<br />
# kNN-Based Applicability Domain Approach (AD-kNN) [11]<br />
# kNN Method Based on Sparse Learning (S-kNN) [10]<br />
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]<br />
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]<br />
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]<br />
<br />
The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features. <br />
<br />
<br />
'''A. Experimental Results on Different Sample Sizes'''<br />
<br />
The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.<br />
<br />
[[File:Table_I_kNN.png | center | 800x800px]]<br />
<br />
The following key results are noted:<br />
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.<br />
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.<br />
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.<br />
<br />
<br />
'''B. Experimental Results on Different Feature Numbers'''<br />
<br />
The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score [15] approach was used to rank and select the most information features in the datasets. <br />
<br />
[[File:Table_II_kNN.png | center | 800x800px]]<br />
<br />
From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.<br />
<br />
== Conclusion == <br />
<br />
This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods achieve efficient classification by designing a training step that reduces the run time of the test stage. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features. <br />
<br />
== Critiques == <br />
<br />
*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy. <br />
* The authors addressed that some of the UCI datasets contained imbalance data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.). <br />
*While the authors contrast their ktTee and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.<br />
<br />
* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.<br />
<br />
* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.<br />
<br />
* It would be really helpful if Ktrees method can be explained at the very beginning. The transition from KNN to Ktrees are not very smooth.<br />
<br />
== References == <br />
<br />
[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.<br />
<br />
[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.<br />
<br />
[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.<br />
<br />
[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.<br />
<br />
[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.<br />
<br />
[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.<br />
<br />
[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.<br />
<br />
[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.<br />
<br />
[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.<br />
<br />
[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.<br />
<br />
[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013. <br />
<br />
[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.<br />
<br />
[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.<br />
<br />
[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.<br />
<br />
[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Efficient_kNN_Classification_with_Different_Numbers_of_Nearest_Neighbors&diff=48418Efficient kNN Classification with Different Numbers of Nearest Neighbors2020-11-30T08:41:41Z<p>A29mukhe: /* Train a Decision Tree using k as the label */</p>
<hr />
<div>== Presented by == <br />
Cooper Brooke, Daniel Fagan, Maya Perelman<br />
<br />
== Introduction == <br />
Unlike popular parametric approaches, which first utilizes training observations to learn a model and subsequently predict test samples with this model, the non-parametric k-nearest neighbors (kNNs) method classifies observations with a majority rules approach. kNN classifies test samples by assigning them the label of their k closest training observation (neighbours). This method has become very popular due to its significant performance and easy implementation. <br />
<br />
There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples. The second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.<br />
<br />
== Previous Work == <br />
<br />
Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications. <br />
<br />
Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.<br />
<br />
== Motivation == <br />
<br />
Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seek to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.<br />
<br />
A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.<br />
<br />
== Approach == <br />
<br />
<br />
=== kTree Classification ===<br />
<br />
The proposed kTree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_1.png | center | 800x800px]]<br />
<br />
==== Reconstruction ====<br />
<br />
The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}</math> represents the training set can be written as:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2<br />
\end{aligned}$$<br />
<br />
In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples. <br />
<br />
The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:<br />
<br />
$$\begin{aligned}<br />
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})<br />
\end{aligned}$$<br />
<br />
where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features.<br />
<br />
This gives a final objective function of:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})<br />
\end{aligned}$$<br />
<br />
Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.<br />
<br />
==== Calculate ''k'' for training set ====<br />
<br />
Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.<br />
<br />
For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:<br />
<br />
[[File:Approach_Figure_2.png | center | 300x300px]]<br />
<br />
The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 is non-zero.<br />
<br />
==== Train a Decision Tree using ''k'' as the label ====<br />
<br />
In a normal decision tree, the target data is the labels themselves. In the kTree method, the target data is the optimal ''k'' values for each sample that were solved for in the previous step. So this decision tree has the following form:<br />
<br />
[[File:Approach_Figure_3.png | center | 300x300px]]<br />
<br />
==== Making Predictions for Test Data ====<br />
<br />
The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbours across '''all''' of the training data.<br />
<br />
=== k*Tree Classification ===<br />
<br />
The proposed k*Tree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_4.png | center | 800x800px]]<br />
<br />
Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.<br />
<br />
While all steps previous are the exact same, the k*Tree method not only stores the optimal ''k'' value but also the following information:<br />
<br />
* The training samples that have the same optimal ''k''<br />
* The ''k'' nearest neighbours of the previously identified training samples<br />
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours<br />
<br />
The data stored in each node is summarized in the following figure:<br />
<br />
[[File:Approach_Figure_5.png | center | 800x800px]]<br />
<br />
In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.<br />
<br />
== Experiments == <br />
<br />
In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:<br />
<br />
# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.<br />
# kNN-Based Applicability Domain Approach (AD-kNN) [11]<br />
# kNN Method Based on Sparse Learning (S-kNN) [10]<br />
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]<br />
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]<br />
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]<br />
<br />
The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features. <br />
<br />
<br />
'''A. Experimental Results on Different Sample Sizes'''<br />
<br />
The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.<br />
<br />
[[File:Table_I_kNN.png | center | 800x800px]]<br />
<br />
The following key results are noted:<br />
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.<br />
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.<br />
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.<br />
<br />
<br />
'''B. Experimental Results on Different Feature Numbers'''<br />
<br />
The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score [15] approach was used to rank and select the most information features in the datasets. <br />
<br />
[[File:Table_II_kNN.png | center | 800x800px]]<br />
<br />
From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.<br />
<br />
== Conclusion == <br />
<br />
This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods achieve efficient classification by designing a training step that reduces the run time of the test stage. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features. <br />
<br />
== Critiques == <br />
<br />
*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy. <br />
* The authors addressed that some of the UCI datasets contained imbalance data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.). <br />
*While the authors contrast their ktTee and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.<br />
<br />
* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.<br />
<br />
* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.<br />
<br />
* It would be really helpful if Ktrees method can be explained at the very beginning. The transition from KNN to Ktrees are not very smooth.<br />
<br />
== References == <br />
<br />
[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.<br />
<br />
[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.<br />
<br />
[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.<br />
<br />
[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.<br />
<br />
[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.<br />
<br />
[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.<br />
<br />
[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.<br />
<br />
[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.<br />
<br />
[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.<br />
<br />
[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.<br />
<br />
[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013. <br />
<br />
[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.<br />
<br />
[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.<br />
<br />
[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.<br />
<br />
[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction&diff=48417Surround Vehicle Motion Prediction2020-11-30T08:39:07Z<p>A29mukhe: /* Data */</p>
<hr />
<div>DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections<br />
== Presented by == <br />
Mushi Wang, Siyuan Qiu, Yan Yu<br />
<br />
== Introduction ==<br />
<br />
This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.<br />
<br />
== Previous Work ==<br />
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].<br />
<br />
== Motivation == <br />
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.<br />
<br />
== Framework == <br />
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.<br />
<br />
<center>[[Image:Figure1_Yan.png|800px|]]</center><br />
<br />
== LSTM-RNN based motion predictor == <br />
<br />
=== Data ===<br />
Multi-lane turn intersections are the target roads in this paper. The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.<br />
<br />
=== Motion predictor ===<br />
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view. <br />
<br />
<br />
<center>[[Image:Figure7b_Yan.png|500px|]]</center><br />
<br />
<br />
==== Network architecture ==== <br />
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.<br />
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.<br />
<br />
<center>[[Image:Figure8_Yan.png|800px|]]</center><br />
<br />
==== Input and output features ==== <br />
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.<br />
==== Encoder and decoder ==== <br />
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit. <br />
==== Sequence length ==== <br />
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.<br />
<br />
== Motion planning based on surrounding vehicle motion prediction == <br />
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\<br />
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2 <br />
\end{split}<br />
\end{equation*}<br />
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles. <br />
The constraints of the control input are defined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\<br />
&||\mu(k+1|t) - \mu(k|t)|| \leq S<br />
\end{split}<br />
\end{equation*}<br />
Determine the position and speed boundary based on the predicted state:<br />
\begin{equation*}<br />
\begin{split}<br />
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\<br />
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0<br />
\end{split}<br />
\end{equation*}<br />
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.<br />
<br />
== Prediction performance analysis and application to motion planning ==<br />
=== Accuracy analysis ===<br />
The proposed algorithm was compared with the results from three base algorithms, a path-following model with <br />
constant velocity, a path-following model with traffic flow and a CTRV model.<br />
<br />
We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>, <br />
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:<br />
<br />
\begin{equation*}<br />
\begin{split}<br />
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\ <br />
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\ <br />
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\ <br />
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}<br />
\end{split}<br />
\end{equation*}<br />
<br />
The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean, <br />
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped <br />
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers' <br />
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within <br />
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore, <br />
the proposed algorithm can be precise and maintain safety simultaneously.<br />
<br />
=== Motion planning application ===<br />
==== Case study of a multi-lane left turn scenario ====<br />
The proposed method mimics a human driver better, by simulating a human driver's decision-making process. <br />
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target <br />
vehicle, even when the target vehicle was not following the intersection guideline.<br />
<br />
==== Statistical analysis of motion planning application results ====<br />
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to <br />
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based <br />
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when <br />
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means <br />
that these cases took place sufficiently beyond the safety distance, and had little influence on determining <br />
the behaviour of the subject vehicle.<br />
<br />
In order to compare the similarities between the results form the proposed algorithm and human driving decisions, <br />
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math><br />
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm, <br />
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base <br />
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm <br />
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed <br />
model is efficient and safe.<br />
<br />
== Conclusion ==<br />
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm.<br />
<br />
== Future works ==<br />
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.<br />
<br />
2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.<br />
<br />
3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.<br />
<br />
4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.<br />
<br />
== Critiques ==<br />
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.<br />
<br />
This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.<br />
<br />
Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.<br />
<br />
There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.<br />
<br />
It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.<br />
<br />
The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.<br />
<br />
It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.<br />
<br />
== Reference ==<br />
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.<br />
<br />
[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.<br />
<br />
[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.<br />
<br />
[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.<br />
<br />
[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].<br />
<br />
[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Speech2Face:_Learning_the_Face_Behind_a_Voice&diff=48416Speech2Face: Learning the Face Behind a Voice2020-11-30T08:33:05Z<p>A29mukhe: /* Discussion and Critiques */</p>
<hr />
<div>== Presented by == <br />
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang<br />
<br />
== Introduction ==<br />
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.<br />
<br />
== Previous Work ==<br />
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.<br />
<br />
Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.<br />
<br />
== Motivation ==<br />
It seems to be a common trait among humans to imagine what some people look like when we hear their voices before we have seen what they look lke. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.<br />
<br />
== Model Architecture == <br />
<br />
'''Speech2Face model and training pipeline'''<br />
<br />
[[File:ModelFramework.jpg]]<br />
<br />
Figure 1. '''Speech2Face model and training pipeline''' <br />
<br />
<br />
<br />
The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network. <br />
<br />
'''Voice Encoder Architecture''' <br />
<br />
[[File:VoiceEncoderArch.JPG]]<br />
<br />
Table 1: '''Voice encoder architecture'''<br />
<br />
<br />
<br />
The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time and allows the model to be used for input speeches of varying lengths. Two fully connected layers at the end are used to return a 4096-dimensional facial feature output.<br />
<br />
'''Training'''<br />
<br />
The AVSSpeech dataset, a large-scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair color, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.<br />
<br />
The voice encoder is trained in a self-supervised manner. A frame that contains the face is extracted from each video and then inputted to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096-dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.<br />
<br />
In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$<br />
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.<br />
<br />
<center><br />
[[File:L1vsTotalLoss.png | 700px]]<br />
</center><br />
<br />
Figure 2: '''Qualitative results on the AVSpeech test set'''<br />
<br />
== Results ==<br />
<br />
'''Confusion Matrix and Dataset statistics'''<br />
<br />
<center><br />
[[File:Confusionmatrix.png| 600px]]<br />
</center><br />
<br />
Figure 3. '''Facial attribute evaluation''' <br />
<br />
<br />
<br />
In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face. <br />
<br />
'''Feature Similarity'''<br />
<br />
<center><br />
[[File:FeatSim.JPG]]<br />
</center><br />
<br />
Table 2. '''Feature similarity'''<br />
<br />
<br />
<br />
Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth. <br />
<br />
'''S2f -> Face retrieval performance'''<br />
<br />
<center><br />
[[File: Retrieval.JPG]]<br />
</center><br />
<br />
Table 3. '''S2F -> Face retrieval performance'''<br />
<br />
<br />
<br />
The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From Table 3, above, we see that both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.<br />
<br />
== Conclusion ==<br />
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.<br />
<br />
== Discussion and Critiques ==<br />
<br />
There is evidence that the results of the model may be heavily influenced by external factors:<br />
<br />
1. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the model's prediction of ethnicity towards white. The bias in the results shows that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. <br />
<br />
2. The model was shown to infer different face features based on language. This puts into question how heavily the model depends on the spoken language. The paper mentioned the quality of face reconstruction may be affected by uncommon languages, where English is the most popular language on Youtube(training set). Testing a more controlled sample where all speech recording was of the same language may help address this concern to determine the model's reliance on spoken language.<br />
<br />
3. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender, and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face and may result in a compounding effect on the miss-classification rate. <br />
<br />
4. Figure 4.b shows the AVSpeech dataset statistics. However, it doesn't show the statistics about speakers' ethnicity and the language of the video. If we train the model with a more comprehensive dataset that includes enough Asian/Indian English speakers and native language speakers will this increase the accuracy?<br />
<br />
5. One concern about the source of the training data, i.e. the Youtube videos, is that resolution varies a lot since the videos are randomly selected. That may be the reason why the proposed model performs badly on some certain features. For example, it is hard to tell the age when the resolution is bad because the wrinkles on the face are neglected.<br />
<br />
6. The topic of this project is very interesting, but I highly doubt this model will be practical in real world problems. Because there are many factors to affect a person's sound in the real world environment. Sounds such like phone clock, TV, car horn and so on. These sounds will decrease the accuracy of the predicted result of the model.<br />
<br />
== References ==<br />
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In<br />
IEEE International Conference on Computer Vision (ICCV),<br />
2017.<br />
<br />
[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International<br />
Conference on Acoustics, Speech and Signal Processing<br />
(ICASSP), 2019.<br />
<br />
[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.<br />
<br />
[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.<br />
<br />
[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Improving_neural_networks_by_preventing_co-adaption_of_feature_detectors&diff=48415Improving neural networks by preventing co-adaption of feature detectors2020-11-30T08:28:45Z<p>A29mukhe: /* Critiques */</p>
<hr />
<div>== Presented by ==<br />
Stan Lee, Seokho Lim, Kyle Jung, Dae Hyun Kim<br />
<br />
= Introduction =<br />
In this paper, Hinton et al. introduces a novel way to improve neural networks’ performance. By omitting neurons in hidden layers with a probability of 0.5, each hidden unit is prevented from relying on other hidden units being present during training. Hence there are fewer co-adaptations among them on the training data. Called “dropout,” this process is also an efficient alternative to training many separate networks and average their predictions on the test set.<br />
They used the standard, stochastic gradient descent algorithm and separated training data into mini-batches. An upper bound was set on the L2 norm of incoming weight vector for each hidden neuron, which was normalized if its size exceeds the bound. They found that using a constraint, instead of a penalty, forced model to do a more thorough search of the weight-space, when coupled with the very large learning rate that decays during training. <br />
Their dropout models included all of the hidden neurons, and their outgoing weights were halved to account for the chances of omission. The models were shown to result in lower test error rates on several datasets: MNIST; TIMIT; Reuters Corpus Volume; CIFAR-10; and ImageNet.<br />
<br />
= MNIST =<br />
The MNIST dataset contains 70,000 digit images of size 28 x 28. To see the impact of dropout, they used 4 different neural networks (784-800-800-10, 784-1200-1200-10, 784-2000-2000-10, 784-1200-1200-1200-10), using the same dropout rates as 50% for hidden neurons and 20% for visible neurons. Stochastic gradient descent was used with mini-batches of size 100 and a cross-entropy objective function as the loss function. Weights were updated after each minibatch, and training was done for 3000 epochs. An exponentially decaying learning rate <math>\epsilon</math> was used, with the initial value set as 10.0, and it was multiplied by 0.998 at the end of each epoch. At each hidden layer, the incoming weight vector for each hidden neuron was set an upper bound of its length, <math>l</math>, and they found from cross-validation that the results were the best when <math>l</math> = 15. Initial weights values were pooled from a normal distribution with mean 0 and standard deviation of 0.01. To update weights, an additional variable, ''p'', called momentum, was used to accelerate learning. The initial value of <math>p</math> was 0.5, and it increased linearly to the final value 0.99 during the first 500 epochs, remaining unchanged after. Also, when updating weights, the learning rate was multiplied by <math>1 – p</math>. <math>L</math> denotes the gradient of loss function.<br />
<br />
[[File:weights_mnist2.png|center|400px]]<br />
<br />
The best published result for a standard feedforward neural network was 160 errors, and it was reduced to about 130 errors with dropout. By omitting a random 20% of the input pixels, it was further reduced to 110 errors. The following figure visualizes the result.<br />
[[File:mnist_figure.png|center|500px]]<br />
A publicly available pre-trained deep belief net resulted in 118 errors, and it was reduced to 92 errors when the model was fine-tuned with dropout. Another publicly available model was a deep Boltzmann machine, and it resulted in 103, 97, 94, 93 and 88 when the model was fine-tuned using standard backpropagation and was unrolled. They were reduced to 83, 79, 78, 78, and 77 when the model was fine-tuned with dropout – the mean of 79 errors was a record for models that do not use prior knowledge or enhanced training sets.<br />
<br />
= TIMIT = <br />
<br />
TIMIT dataset includes voice samples of 630 American English speakers varying across 8 different dialects. It is often used to evaluate the performance of automatic speech recognition systems. Using Kaldi, the dataset was pre-processed to extract input features in the form of log filter bank responses.<br />
<br />
=== Pre-training and Training ===<br />
<br />
For pretraining, they pretrained their neural network with a deep belief network and the first layer was built using Restricted Boltzmann Machine (RBM). Initializing visible biases with zero, weights were sampled from random numbers that followed normal distribution <math>N(0, 0.01)</math>. Each visible neuron’s variance was set to 1.0 and remained unchanged.<br />
<br />
Minimizing Contrastive Divergence (CD) was used to facilitate learning. Since momentum is used to speed up learning, it was initially set to 0.5 and increased linearly to 0.9 over 20 epochs. The average gradient had 0.001 of a learning rate which was then multiplied by <math>(1-momentum)</math> and L2 weight decay was set to 0.001. After setting up the hyperparameters, the model was done training after 100 epochs. Binary RBMs were used for training all subsequent layers with a learning rate of 0.01. Then, <math>p</math> was set as the mean activation of a neuron in the data set and the visible bias of each neuron was initialized to <math>log(p/(1 − p))</math>. Training each layer with 50 epochs, all remaining hyper-parameters were the same as those for the Gaussian RBM.<br />
<br />
=== Dropout tuning ===<br />
<br />
The initial weights were set in a neural network from the pretrained RBMs. To finetune the network with dropout-backpropagation, momentum was initially set to 0.5 and increased linearly up to 0.9 over 10 epochs. The model had a small constant learning rate of 1.0 and it was used to apply to the average gradient on a minibatch. The model also retained all other hyperparameters the same as the model from MNIST dropout finetuning. The model required approximately 200 epochs to converge. For comparison purpose, they also finetuned the same network with standard backpropagation with a learning rate of 0.1 with the same hyperparameters.<br />
<br />
=== Classification Test and Performance ===<br />
<br />
A Neural network was constructed to output the classification error rate on the test set of TIMIT dataset. They have built the neural network with four fully-connected hidden layers with 4000 neurons per layer. The output layer distinguishes distinct classes from 185 softmax output neurons that are merged into 39 classes. After constructing the neural network, 21 adjacent frames with an advance of 10ms per frame was given as an input.<br />
<br />
Comparing the performance of dropout with standard backpropagation on several network architectures and input representations, dropout consistently achieved lower error and cross-entropy. Results showed that it significantly controls overfitting, making the method robust to choices of network architecture. It also allowed much larger nets to be trained and removed the need for early stopping. Thus, neural network architectures with dropout are not very sensitive to the choice of learning rate and momentum.<br />
<br />
= Reuters Corpus Volume =<br />
Reuters Corpus Volume I archives 804,414 news documents that belong to 103 topics. Under four major themes - corporate/industrial, economics, government/social, and markets – they belonged to 63 classes. After removing 11 classes with no data and one class with insufficient data, they are left with 50 classes and 402,738 documents. The documents were divided into training and test sets equally and randomly, with each document representing the 2000 most frequent words in the dataset, excluding stopwords.<br />
<br />
They trained two neural networks, with size 2000-2000-1000-50, one using dropout and backpropagation, and the other using standard backpropagation. The training hyperparameters are the same as that in MNIST, but training was done for 500 epochs.<br />
<br />
In the following figure, we see the significant improvements by the model with dropout in the test set error. On the right side, we see that the learning with dropout also proceeds smoother. <br />
<br />
[[File:reuters_figure.png|700px|center]]<br />
<br />
= CNN =<br />
<br />
Feed-forward neural networks consist of several layers of neurons where each neuron in a layer applies a linear filter to the input image data and is passed on to the neurons in the next layer. When calculating the neuron’s output, scalar bias aka weights is applied to the filter with nonlinear activation function as parameters of the network that are learned by training data. [[File:cnnbigpicture.jpeg|thumb|upright=2|center|alt=text|Figure: Overview of Convolutional Neural Network]] There are several differences between Convolutional Neural networks and ordinary neural networks. The figure above gives a visual representation of a Convolutional Neural Network. First, CNN’s neurons are organized topographically into a bank and laid out on a 2D grid, so it reflects the organization of dimensions of the input data. Secondly, neurons in CNN apply filters which are local, and which are centered at the neuron’s location in the topographic organization. Meaning that useful metrics or clues to identify the object in an input image which can be found by examining local neighborhoods of the image. Next, all neurons in a bank apply the same filter at different locations in the input image. When looking at the image example, green is an input to one neuron bank, yellow is filter bank, and pink is the output of one neuron bank (convolved feature). A bank of neurons in a CNN applies a convolution operation, aka filters, to its input where a single layer in a CNN typically has multiple banks of neurons, each performing a convolution with a different filter. The resulting neuron banks become distinct input channels into the next layer. The whole process reduces the net’s representational capacity, but also reduces the capacity to overfit.<br />
[[File:bankofneurons.gif|thumb|upright=3|center|alt=text|Figure: Bank of neurons]]<br />
<br />
=== Pooling ===<br />
<br />
Pooling layer summarizes the activities of local patches of neurons in the convolutional layer by subsampling the output of a convolutional layer. Pooling is useful for extracting dominant features, to decrease the computational power required to process the data through dimensionality reduction. The procedure of pooling goes on like this; output from convolutional layers is divided into sections called pooling units and they are laid out topographically, connected to a local neighborhood of other pooling units from the same convolutional output. Then, each pooling unit is computed with some function which could be maximum and average. Maximum pooling returns the maximum value from the section of the image covered by the pooling unit while average pooling returns the average of all the values inside the pooling unit (see example). In result, there are fewer total pooling units than convolutional unit outputs from the previous layer, this is due to larger spacing between pixels on pooling layers. Using the max-pooling function reduces the effect of outliers and improves generalization.<br />
[[File:maxandavgpooling.jpeg|thumb|upright=2|center|alt=text|Figure: Max pooling and Average pooling]]<br />
<br />
=== Local Response Normalization === <br />
<br />
This network includes local response normalization layers which are implemented in lateral form and used on neurons with unbounded activations and permits the detection of high-frequency features with a big neuron response. This regularizer encourages competition among neurons belonging to different banks. Normalization is done by dividing the activity of a neuron in bank <math>i</math> at position <math>(x,y)</math> by the equation:<br />
[[File:local response norm.png|upright=2|center|]] where the sum runs over <math>N</math> ‘adjacent’ banks of neurons at the same position as in the topographic organization of neuron bank. The constants, <math>N</math>, <math>alpha</math> and <math>betas</math> are hyper-parameters whose values are determined using a validation set. This technique is replaced by better techniques such as the combination of dropout and regularization methods (<math>L1</math> and <math>L2</math>)<br />
<br />
=== Neuron nonlinearities ===<br />
<br />
All of the neurons for this model use the max-with-zero nonlinearity where output within a neuron is computed as <math> a^{i}_{x,y} = max(0, z^i_{x,y})</math> where <math> z^i_{x,y} </math> is the total input to the neuron. The reason they use nonlinearity is because it has several advantages over traditional saturating neuron models, such as significant reduction in training time required to reach a certain error rate. Another advantage is that nonlinearity reduces the need for contrast-normalization and data pre-processing since neurons do not saturate- meaning activities simply scale up little by little with usually large input values. For this model’s only pre-processing step, they subtract the mean activity from each pixel and the result is a centered data.<br />
<br />
=== Objective function ===<br />
<br />
The objective function of their network maximizes the multinomial logistic regression objective which is the same as minimizing the average cross-entropy across training cases between the true label and the model’s predicted label.<br />
<br />
=== Weight Initialization === <br />
<br />
It’s important to note that if a neuron always receives a negative value during training, it will not learn because its output is uniformly zero under the max-with-zero nonlinearity. Hence, the weights in their model were sampled from a zero-mean normal distribution with a high enough variance. High variance in weights will set a certain number of neurons with positive values for learning to happen, and in practice, it’s necessary to try out several candidates for variances until a working initialization is found. In their experiment, setting a positive constant, or 1, as biases of the neurons in the hidden layers was helpful in finding it.<br />
<br />
=== Training ===<br />
<br />
In this model, a batch size of 128 samples and momentum of 0.9, we train our model using stochastic gradient descent. The update rule for weight <math>w</math> is $$ v_{i+1} = 0.9v_i + \epsilon <\frac{dE}{dw_i}> i$$ $$w_{i+1} = w_i + v_{i+1} $$ where <math>i</math> is the iteration index, <math>v</math> is a momentum variable, <math>\epsilon</math> is the learning rate and <math>\frac{dE}{dw}</math> is the average over the <math>i</math>th batch of the derivative of the objective with respect to <math>w_i</math>. The whole training process on CIFAR-10 takes roughly 90 minutes and ImageNet takes 4 days with dropout and two days without.<br />
<br />
=== Learning ===<br />
To determine the learning rate for the network, it is a must to start with an equal learning rate for each layer which produces the largest reduction in the objective function with power of ten. Usually, it is in the order of <math>10^{-2}</math> or <math>10^{-3}</math>. In this case, they reduce the learning rate twice by a factor of ten before termination of training.<br />
<br />
= CIFAR-10 =<br />
<br />
=== CIFAR-10 Dataset ===<br />
<br />
Removing incorrect labels, The CIFAR-10 dataset is a subset of the Tiny Images dataset with 10 classes. It contains 5000 training images and 1000 testing images for each class. The dataset has 32 x 32 color images searched from the web and the images are labeled with the noun used to search the image.<br />
<br />
[[File:CIFAR-10.png|thumb|upright=2|center|alt=text|Figure 4: CIFAR-10 Sample Dataset]]<br />
<br />
=== Models for CIFAR-10 ===<br />
<br />
Two models, one with dropout and one without dropout, were built to test the performance of dropout on CIFAR-10. All models have CNN with three convolutional layers each with a pooling layer. The max-pooling method is performed by the pooling layer which follows the first convolutional layer, and the average-pooling method is performed by remaining 2 pooling layers. The first and second pooling layers with <math>N = 9, α = 0.001</math>, and <math>β = 0.75</math> are followed by response normalization layers. A ten-unit softmax layer, which is used to output a probability distribution over class labels, is connected with the upper-most pooling layer. Using filter size of 5×5, all convolutional layers have 64 filter banks.<br />
<br />
Additional changes were made with the model with dropout. The model with dropout enables us to use more parameters because dropout forces a strong regularization on the network. Thus, a fourth weight layer is added to take the input from the previous pooling layer. This fourth weight layer is locally connected, but not convolutional, and contains 16 banks of filters of size 3 × 3 with 50% dropout. Lastly, the softmax layer takes its input from this fourth weight layer.<br />
<br />
Thus, with a neural network with 3 convolutional hidden layers with 3 max-pooling layers, the classification error achieved 16.6% to beat 18.5% from the best published error rate without using transformed data. The model with one additional locally-connected layer and dropout at the last hidden layer produced the error rate of 15.6%.<br />
<br />
= ImageNet =<br />
<br />
===ImageNet Dataset===<br />
<br />
ImageNet is a dataset of millions of high-resolution labeled images in thousands of categories which were collected from the web and labelled by human labellers using MTerk tool (Amazon’s Mechanical Turk crowd-sourcing tool). Because this dataset has millions of labeled images in thousands of categories, it is very difficult to have perfect accuracy on this dataset even for humans because the ImageNet images may contain multiple objects and there are a large number of object classes. ImageNet and CIFAR-10 are very similar, but the scale of ImageNet is about 20 times bigger (1,300,000 vs 60,000). The size of ImageNet is about 1.3 million training images, 50,000 validation images, and 150,000 testing images. They used resized images of 256 x 256 pixels for their experiments.<br />
<br />
'''An ambiguous example to classify:'''<br />
<br />
[[File:imagenet1.png|200px|center]]<br />
<br />
When this paper was written, the best score on this dataset was the error rate of 45.7% by High-dimensional signature compression for large-scale image classification (J. Sanchez, F. Perronnin, CVPR11 (2011)). The authors of this paper could achieve a comparable performance of 48.6% error rate using a single neural network with five convolutional hidden layers with a max-pooling layer in between, followed by two globally connected layers and a final 1000-way softmax layer. When applying 50% dropout to the 6th layer, the error rate was brought down to 42.4%.<br />
<br />
'''ImageNet Dataset:'''<br />
<br />
[[File:imagenet2.png|400px|center]]<br />
<br />
===Models for ImageNet===<br />
<br />
The models for ImageNet with dropout (the model without dropout had a similar approach, but there was a serious issue with overfitting): <br />
<br />
They used a convolutional neural network trained by 224×224 patches randomly extracted from the 256 × 256 images. This could reduce the network’s capacity to overfit the training data and helped generalization as a form of data augmentation. The method of averaging the prediction of the net on ten 224 × 224 patches of the 256 × 256 input image was used for a testing (patched at the center, four corners, and their horizontal reflections).<br />
<br />
To maximize the performance on the validation set, this complicated network architecture was used and it was found that dropout was very effective. Also, it was demonstrated that using non-convolutional higher layers with the number of parameters worked well with dropout, but it had a negative impact to the performance without dropout.<br />
<br />
[[File:modelh2.png|700px|center]] <br />
<br />
[[File:layer2.png|600px|center]]<br />
<br />
It was demonstrated that making a large number of decisions was important for the architecture of the net design for the speech recognition (TIMIT) and object recognition datasets (CIFAR-10 and ImageNet). A separate validation set which evaluated the performance of a large number of different architectures was used to make those decisions, and then they chose the best performance architecture with dropout on the validation set so that they could apply it to the real test set.<br />
<br />
= Critiques =<br />
It is a very brilliant idea to dropout half of the neurons to reduce co-adaptations. It is mentioned that for fully connected layers, dropout in all hidden layers works better than dropout in only one hidden layer. There is another paper Dropout: A Simple Way to Prevent Neural Networks from<br />
Overfitting[https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf] gives a more detailed explanation.<br />
<br />
It will be interesting to see how this paper could be used to prevent overfitting of LSTMs.<br />
<br />
= Conclusion =<br />
<br />
The authors have shown a consistent improvement by the models trained with dropout in classifying objects in the following datasets: MNIST; TIMIT; Reuters Corpus Volume I; CIFAR-10; and ImageNet.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Research_Papers_Classification_System&diff=48414Research Papers Classification System2020-11-30T08:12:54Z<p>A29mukhe: /* Crawling of Abstract Data */</p>
<hr />
<div>= Presented by =<br />
Jill Wang, Junyi (Jay) Yang, Yu Min (Chris) Wu, Chun Kit (Calvin) Li<br />
<br />
= Introduction =<br />
This paper introduces a paper classification system that utilizes the Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and K-means clustering. The most important technology the system used to process big data is the Hadoop Distributed File Systems (HDFS). The system can handle quantitatively complex research paper classification problems efficiently and accurately.<br />
<br />
===General Framework===<br />
<br />
The paper classification system classifies research papers based on the abstracts given that the core of most papers is presented in the abstracts. <br />
<br />
<ol><li>Paper Crawling <br />
<p>Collects abstracts from research papers published during a given period</p></li><br />
<li>Preprocessing<br />
<p> <ol style="list-style-type:lower-alpha"><li>Removes stop words in the papers crawled, in which only nouns are extracted from the papers</li><br />
<li>generates a keyword dictionary, keeping only the top-N keywords with the highest frequencies</li> </ol><br />
</p></li> <br />
<li>Topic Modelling<br />
<p> Use the LDA to group the keywords into topics</p><br />
</li><br />
<li>Paper Length Calculation<br />
<p> Calculates the total number of occurrences of words to prevent an unbalanced TF values caused by the various length of abstracts using the map-reduce algorithm</p><br />
</li><br />
<li>Word Frequency Calculation<br />
<p> Calculates the Term Frequency (TF) values which represent the frequency of keywords in a research paper</p><br />
</li><br />
<li>Document Frequency Calculation<br />
<p> Calculates the Document Frequency (DF) values which represents the frequency of keywords in a collection of research papers. The higher the DF value, the lower the importance of a keyword.</p><br />
</li><br />
<li>TF-IDF calculation<br />
<p> Calculates the inverse of the DF which represents the importance of a keyword.</p><br />
</li><br />
<li>Paper Classification<br />
<p> Classify papers by topics using the K-means clustering algorithm.</p><br />
</li><br />
</ol><br />
<br />
===Technologies===<br />
<br />
The HDFS with a Hadoop cluster composed of one master node, one sub node, and four data nodes is what is used to process the massive paper data. Hadoop-2.6.5 version in Java is what is used to perform the TF-IDF calculation. Spark MLlib is what is used to perform the LDA. The Scikit-learn library is what is used to perform the K-means clustering.<br />
<br />
===HDFS===<br />
<br />
Hadoop Distributed File Systems was used to process big data in this system. What Hadoop does is to break a big collection of data into different partitions and pass each partition to one individual processor. Each processor will only have information about the partition of data it received.<br />
<br />
'''In this summary, we are going to focus on introducing the main algorithms of what this system uses, namely LDA, TF-IDF, and K-Means.'''<br />
<br />
=Data Preprocessing=<br />
===Crawling of Abstract Data===<br />
<br />
Under the assumption that audiences tend to first read the abstract of a paper to gain an overall understanding of the material, it is reasonable to assume the abstract section includes “core words” that can be used to effectively classify a paper's subject.<br />
<br />
An abstract is crawled to have its stop words removed. Stop words are words that are usually ignored by search engines, such as “the”, “a”, and etc. Afterwards, nouns are extracted, as a more condensed representation for efficient analysis.<br />
<br />
This is managed on HDFS. The TF-IDF value of each paper is calculated through map-reduce.<br />
<br />
===Managing Paper Data===<br />
<br />
To construct an effective keyword dictionary using abstract data and keywords data in all of the crawled papers, the authors categorized keywords with similar meanings using a single representative keyword. The approach is called stemming, which is common in cleaning data. 1394 keyword categories are extracted, which is still too much to compute. Hence, only the top 30 keyword categories are used.<br />
<br />
<div align="center">[[File:table_1_kswf.JPG|700px]]</div><br />
<br />
=Topic Modeling Using LDA=<br />
<br />
Latent Dirichlet allocation (LDA) is a generative probabilistic model that views documents as random mixtures over latent topics. Each topic is a distribution over words, and the goal is to extract these topics from documents.<br />
<br />
LDA estimates the topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> using Dirichlet priors for the distributions with a fixed number of topics. For each document, obtain a feature vector:<br />
<br />
\[F = \left( P\left(z_1 | d\right), P\left(z_2 | d\right), \cdots, P\left(z_k | d\right) \right)\]<br />
<br />
In the paper, authors extract topics from preprocessed paper to generate three kinds of topic sets, each with 10, 20, and 30 topics respectively. The following is a table of the 10 topic sets of highest frequency keywords.<br />
<br />
<div align="center">[[File:table_2_tswtebls.JPG|700px]]</div><br />
<br />
<br />
===LDA Intuition===<br />
<br />
LDA uses the Dirichlet priors of the Dirichlet distribution. The following picture illustrates 2-simplex Dirichlet distributions with different alpha values, one for each corner of the triangles. <br />
<br />
<div align="center">[[File:dirichlet_dist.png|700px]]</div><br />
<br />
Simplex is a generalization of the notion of a triangle. In Dirichlet distribution, each parameter will be represented by a corner in simplex, so adding additional parameters implies increasing the dimensions of simplex. As illustrated, when alphas are smaller than 1 the distribution is dense at the corners. When the alphas are greater than 1 the distribution is dense at the centers.<br />
<br />
The following illustration shows an example LDA with 3 topics, 4 words and 7 documents.<br />
<br />
<div align="center">[[File:LDA_example.png|800px]]</div><br />
<br />
In the left diagram, there are three topics, hence it is a 2-simplex. In the right diagram there are four words, hence it is a 3-simplex. LDA essentially adjusts parameters in Dirichlet distributions and multinomial distributions (represented by the points), such that, in the left diagram, all the yellow points representing documents and, in the right diagram, all the points representing topics, are as close to a corner as possible. In other words, LDA finds topics for documents and also finds words for topics. At the end topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> are produced.<br />
<br />
=Term Frequency Inverse Document Frequency (TF-IDF) Calculation=<br />
<br />
TF-IDF is widely used to evaluate the importance of a set of words in the fields of information retrieval and text mining. It is a combination of term frequency (TF) and inverse document frequency (IDF). The idea behind this combination is<br />
It evaluates the importance of a word within a document, and<br />
It evaluates the importance of the word among the collection of all documents<br />
<br />
The TF-IDF formula has the following form:<br />
<br />
\[TF-IDF_{i,j} = TF_{i,j} \times IDF_{i}\]<br />
<br />
where i stands for the <math>i^{th}</math> word and j stands for the <math>j^{th}</math> document.<br />
<br />
===Term Frequency (TF)===<br />
<br />
TF evaluates the percentage of a given word in a document. Thus, TF value indicates the importance of a word. The TF has a positive relation with the importance.<br />
<br />
In this paper, we only calculate TF for words in the keyword dictionary obtained. For a given keyword i, <math>TF_{i,j}</math> is the number of times word i appears in document j divided by the total number of words in document j.<br />
<br />
The formula for TF has the following form:<br />
<br />
\[TF_{i,j} = \frac{n_{i,j} }{\sum_k n_{k,j} }\]<br />
<br />
where i stands for the <math>i^{th}</math> word, j stands for the <math>j^{th}</math> document, and <math>n_{i,j}</math> stands for the number of times words i appear in document j.<br />
<br />
Note that the denominator is the total number of words remaining in document j after crawling.<br />
<br />
===Document Frequency (DF)===<br />
<br />
DF evaluates the percentage of documents that contain a given word over the entire collection of documents. Thus, the higher DF value is, the less important the word is. Since DF and the importance of the word have an inverse relation, we use IDF instead of DF.<br />
<br />
<math>DF_{i}</math> is the number of documents in the collection with word i divided by the total number of documents in the collection. The formula for DF has the following form:<br />
<br />
\[DF_{i} = \frac{|d_k \in D: n_{i,k} > 0|}{|D|}\]<br />
<br />
where <math>n_{i,k}</math> is the number of times word i appears in document k, |D| is the total number of documents in the collection.<br />
<br />
===Inverse Document Frequency (IDF)===<br />
<br />
In this paper, IDF is calculated in a log scale. Since we will receive a large number of documents, i.e, we will have a large |D|<br />
<br />
The formula for IDF has the following form:<br />
<br />
\[IDF_{i} = log\left(\frac{|D|}{|\{d_k \in D: n_{i,k} > 0\}|}\right)\]<br />
<br />
As mentioned before, we will use HDFS. The actual formula applied is:<br />
<br />
\[IDF_{i} = log\left(\frac{|D|+1}{|\{d_k \in D: n_{i,k} > 0\}|+1}\right)\]<br />
<br />
=Paper Classification Using K-means Clustering=<br />
<br />
The K-means clustering is an unsupervised classification algorithm that groups similar data into the same class. It is an efficient and simple method that can work with different types of data attributes and is able to handle noise and outliers.<br />
<br><br />
<br />
Given a set of <math>d</math> by <math>n</math> dataset <math>\mathbf{X} = \left[ \mathbf{x}_1 \cdots \mathbf{x}_n \right]</math>, the algorithm will assign each <math>\mathbf{x}_j</math> into <math>k</math> different clusters based on the characteristics of <math>\mathbf{x}_j</math> itself.<br />
<br><br />
<br />
Moreover, when assigning data into a cluster, the algorithm will also try to minimise the distances between the data and the centre of the cluster which the data belongs to. That is, k-means clustering will minimise the sum of square error:<br />
<br />
\begin{align*}<br />
min \sum_{i=1}^{k} \sum_{j \in C_i} ||x_j - \mu_i||^2<br />
\end{align*}<br />
<br />
where<br />
<ul><br />
<li><math>k</math>: the number of clusters</li><br />
<li><math>C_i</math>: the <math>i^th</math> cluster</li><br />
<li><math>x_j</math>: the <math>j^th</math> data in the <math>C_i</math></li><br />
<li><math>mu_i</math>: the centroid of <math>C_i</math></li><br />
<li><math>||x_j - \mu_i||^2</math>: the Euclidean distance between <math>x_j</math> and <math>\mu_i</math></li><br />
</ul><br />
<br><br />
<br />
Since the goal for this paper is to classify research papers and group papers with similar topics based on keywords, the paper uses the K-means clustering algorithm. The algorithm first computes the cluster centroid for each group of papers with a specific topic. Then, it will assign a paper into a cluster based on the Euclidean distance between the cluster centroid and the paper’s TF-IDF value.<br />
<br><br />
<br />
However, different values of <math>k</math> (the number of clusters) will return different clustering results. Therefore, it is important to define the number of clusters before clustering. For example, in this paper, the authors choose to use the Elbow scheme to determine the value of <math>k</math>. Also, to measure the performance of clustering, the authors decide to use the Silhouette scheme. The results of clustering are validated if the Silhouette scheme returns a value greater than <math>0.5</math>.<br />
<br />
=System Testing Results=<br />
<br />
In this paper, the dataset has 3264 research papers from the Future Generation Computer System (FGCS) journal between 1984 and 2017. For constructing keyword dictionaries for each paper, the authors have introduced three methods as shown below:<br />
<br />
<div align="center">[[File:table_3_tmtckd.JPG|700px]]</div><br />
<br />
<br />
Then, the authors use the Elbow scheme to define the number of clusters for each method with different numbers of keywords before running the K-means clustering algorithm. The results are shown below:<br />
<br />
<div align="center">[[File:table_4_nocobes.JPG|700px]]</div><br />
<br />
According to table 4, there is a positive correlation between the number of keywords and the number of clusters. In addition, method 3 combines the advantages for both method 1 and method 2; thus, method 3 requires the least clusters in total. On the other hand, the wrong keywords might be presented in papers; hence, it might not be possible to group papers with similar subjects correctly by using method 1 and so method 1 needs the most number of clusters in total.<br />
<br />
<br />
Next, the Silhouette scheme had been used for measuring the performance for clustering. The average of the Silhouette values for each method with different numbers of keywords are shown below:<br />
<br />
<div align="center">[[File:table_5_asv.JPG|700px]]</div><br />
<br />
Since the clustering is validated if the Silhouette’s value is greater than 0.5, for methods with 10 and 30 keywords, the K-means clustering algorithm produces good results.<br />
<br />
<br />
To evaluate the accuracy of the classification system in this paper, the authors use the F-Score. The authors execute 5 times of experiment and use 500 randomly selected research papers for each trial. The following histogram shows the average value of F-Score for the three methods and different numbers of keywords:<br />
<br />
<div align="center">[[File:fig_16_fsvotm.JPG|700px]]</div><br />
<br />
Note that “TFIDF” means method 1, “LDA” means method 2, and “TFIDF-LDA” means method 3. The number 10, 20, and 30 after each method is the number of keywords the method has used.<br />
According to the histogram above, method 3 has the highest F-Score values than the other two methods with different numbers of keywords. Therefore, the classification system is most accurate when using method 3 as it combines the advantages for both method 1 and method 2.<br />
<br />
=Conclusion=<br />
<br />
This paper introduces a classification system that classifies papers into different topics by using TF-IDF and LDA scheme with K-means clustering algorithm. This system allows users to search the papers they want quickly and with the most productivity.<br />
<br />
Furthermore, this classification system might be also used in different types of texts (e.g. documents, tweets, etc.) instead of only classifying research papers.<br />
<br />
=Critique=<br />
<br />
In this paper, DF values are calculated within each partition. This results that for each partition, DF value for a given word will vary and may have an inconsistent result for different partition methods. As mentioned above, there might be a divide by zero problem since some partitions do not have documents containing a given word, but this can be solved by introducing a dummy document as the authors did. Another method that might be better at solving inconsistent results and the divide by zero problems is to have all partitions to communicate with their DF value. Then pass the merged DF value to all partitions to do the final IDF and TF-IDF value. Having all partitions to communicate with the DF value will guarantee a consistent DF value across all partitions and helps avoid a divide by zero problem as words in the keyword dictionary must appear in some documents in the whole collection.<br />
<br />
=References=<br />
<br />
Blei DM, el. (2003). Latent Dirichlet allocation. J Mach Learn Res 3:993–1022<br />
<br />
Gil, JM, Kim, SW. (2019). Research paper classification systems based on TF-IDF and LDA schemes. ''Human-centric Computing and Information Sciences'', 9, 30 (2019). https://doi.org/10.1186/s13673-019-0192-7<br />
<br />
Liu, S. (2019, January 11). Dirichlet distribution Motivating LDA. Retrieved November 2020, from https://towardsdatascience.com/dirichlet-distribution-a82ab942a879<br />
<br />
Serrano, L. (Director). (2020, March 18). Latent Dirichlet Allocation (Part 1 of 2) [Video file]. Retrieved 2020, from https://www.youtube.com/watch?v=T05t-SqKArY</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=IPBoost&diff=48411IPBoost2020-11-30T08:05:38Z<p>A29mukhe: /* Sparsification */</p>
<hr />
<div>== Presented by == <br />
Casey De Vera, Solaiman Jawad<br />
<br />
== Introduction == <br />
Boosting is an important and a fairly standard technique in a classification that combines several base learners (learners which have a “low accuracy”), into one boosted learner (learner with a “high accuracy”). Pioneered by the AdaBoost approach of Freund & Schapire, in recent decades there has been extensive work on boosting procedures and analyses of their limitations.<br />
<br />
In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows:<br />
<br />
Let <math>T</math> represents the number of iteration and <math>t</math> represents the iterator. <br />
<br />
for <math> t= 1, \cdots, T </math> do the following:<br />
<br />
::1. Train a learner <math> \mu_t</math> from a given class of base learners on the data distribution <math> \mathcal D_t</math><br />
<br />
::2. Evaluate the performance of <math> \mu_t</math> by computing its loss.<br />
<br />
::3. Push weight of the data distribution <math> \mathcal D_t</math> towards the misclassified examples leading to <math> \mathcal D_{t+1}</math>, usually a relatively good model will have a higher weight on those data that misclassified.<br />
<br />
Finally, the learners are combined with some form of voting (e.g., soft or hard voting, averaging, thresholding).<br />
<br />
<br />
[[File:boosting.gif|200px|thumb|right]] A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function by means of coordinate gradient descent. Boosting schemes of this type are often referred to as '''convex potential boosters'''. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. However, they can be defeated easily by a small number of incorrect labels which can be quite difficult to fix. The reason being is that we try to solve misclassified examples by moving the weights around, which results in bad performance on unseen data (Shown by zoomed example to the right). In fact, in theory, provided the class of base learners is rich enough, a perfect strong learner can be constructed that has accuracy 1, however, clearly, such a learner might not necessarily generalize well. Boosted learners can generate some quite complicated decision boundaries, much more complicated than that of the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line and XGBoost was used to learn a classifier:<br />
<br />
<br />
Recently non-convex optimization approaches for solving machine learning problems have gained significant attention. In this paper, the non-convex boosting in classification using integer programming has been explored, and the real-world practicability of the approach while circumventing shortcomings of convex boosting approaches is also shown. The paper reports results that are comparable to or better than current state-of-the-art approaches.<br />
<br />
== Motivation ==<br />
<br />
In reality, we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. Noisy labels are an issue because when a model is trained with excessive amounts of noisy labels, the performance and accuracy deteriorate greatly. However, if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case, these examples cannot be correctly classified and the procedure shifts more and more weight towards these bad examples. This eventually leads to a strong learner, that perfectly predicts the (flawed) training data; however, that does not generalize well anymore. This intuition has been formalized by [LS] who construct a “hard” training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in performance of these boosted learners; see tables below. The more technical reason for this problem is actually the convexity of the loss function that is minimized by the boosting procedure. Clearly, one can use all types of “tricks” such as early stopping but at the end of the day, this is not solving the fundamental problem.<br />
<br />
== IPBoost: Boosting via Integer Programming ==<br />
<br />
<br />
===Integer Program Formulation===<br />
Let <math>(x_1,y_1),\cdots, (x_N,y_N) </math> be the training set with points <math>x_i \in \mathbb{R}^d</math> and two-class labels <math>y_i \in \{\pm 1\}</math> <br />
* class of base learners: <math> \Omega :=\{h_1, \cdots, h_L: \mathbb{R}^d \rightarrow \{\pm 1\}\} </math> and <math>\rho \ge 0</math> be given. <br />
* error function <math> \eta </math><br />
Our boosting model is captured by the integer programming problem. We can call this our primal problem: <br />
<br />
$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j=1}^L \eta_{ij}\lambda_k+(1+\rho)z_i \ge \rho \ \ \ <br />
\forall i=1,\cdots, N \\ <br />
&\sum_{j=1}^L \lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$<br />
<br />
For the error function <math>\eta</math>, three options were considered:<br />
<br />
(i) <math> \pm 1 </math> classification from learners<br />
$$ \eta_{ij} := 2\mathbb{I}[h_j(x_i) = y_i] - 1 = y_i \cdot h_j(x_i) $$<br />
(ii) class probabilities of learners<br />
$$ \eta_{ij} := 2\mathbb{P}[h_j(x_i) = y_i] - 1$$<br />
(iii) SAMME.R error function for learners<br />
$$ \frac{1}{2}y_i\log\left(\frac{\mathbb{P}[h_j(x_i) = 1]}{\mathbb{P}[h_j(x_i) = -1]}\right)$$<br />
<br />
===Solution of the IP using Column Generation===<br />
<br />
The goal of column generation is to provide an efficient way to solve the linear programming relaxation of the primal by allowing the <math>z_i </math> variables to assume fractional values. Moreover, columns, i.e., the base learners, <math> \mathcal L \subseteq [L]. </math> are left out because there are too many to handle efficiently and most of them will have their associated weight equal to zero in the optimal solution anyway. To generate columns, a <i>branch and bound</i> framework is used. Columns are generated within a<br />
branch-and-bound framework leading effectively to a branch-and-bound-and-price algorithm being used; this is significantly more involved compared to column generation in linear programming. To check the optimality of an LP solution, a subproblem, called the pricing problem, is solved to try to identify columns with a profitable reduced cost. If such columns are found, the LP is re-optimized. Branching occurs when no profitable columns are found, but the LP solution does not satisfy the integrality conditions. Branch and price apply column generation at every node of the branch and bound tree.<br />
<br />
The restricted master primal problem is <br />
<br />
$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j\in \mathcal L} \eta_{ij}\lambda_j+(1+\rho)z_i \ge \rho \ \ \ <br />
\forall i \in [N]\\ <br />
&\sum_{j\in \mathcal L}\lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$<br />
<br />
<br />
Its restricted dual problem is:<br />
<br />
$$ \begin{align*}\max \rho &\sum^{N}_{i=1}w_i + v - \sum^{N}_{i=1}u_i<br />
\\ s.t. &\sum_{i=1}^N \eta_{ij}w_k+ v \le 0 \ \ \ \forall j \in L \\ <br />
&(1+\rho)w_i - u_i \le 1 \ \ \ \forall i \in [N] \\ &w \ge 0, u \ge 0, v\ free\end{align*}$$<br />
<br />
Furthermore, there is a pricing problem used to determine, for every supposed optimal solution of the dual, whether the solution is actually optimal, or whether further constraints need to be added into the primal solution. With this pricing problem, we check whether the solution to the restricted dual is feasible. This pricing problem can be expressed as follows:<br />
<br />
$$ \sum_{i=1}^N \eta_{ij}w_k^* + v^* > 0 $$<br />
<br />
The optimal misclassification values are determined by a branch-and-price process that branches on the variables <math> z_i </math> and solves the intermediate LPs using column generation.<br />
<br />
===Algorithm===<br />
<div style="margin-left: 3em;"><br />
<math> D = \{(x_i, y_i) | i ∈ I\} ⊆ R^d × \{±1\} </math>, class of base learners <math>Ω </math>, margin <math> \rho </math> <br><br />
'''Output:''' Boosted learner <math> \sum_{j∈L^∗}h_jλ_j^* </math> with base learners <math> h_j </math> and weights <math> λ_j^* </math> <br><br />
<br />
<ol><br />
<br />
<li margin-left:30px> <math> T ← \{([0, 1]^N, \emptyset)\} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // set of local bounds and learners for open subproblems </li><br />
<li> <math> U ← \infty, L^∗ ← \emptyset </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Upper bound on optimal objective </li><br />
<li> '''while''' <math>\ T \neq \emptyset </math> '''do''' </li><br />
<li> &emsp; Choose and remove <math>(B,L) </math> from <math>T </math> </li><br />
<li> &emsp; '''repeat''' </li><br />
<li> &emsp; &emsp; Solve the primal IP using the local bounds on <math> z </math> in <math>B</math> with optimal dual solution <math> (w^∗, v^∗, u^∗) </math> </li><br />
<li> &emsp; &emsp; Find learner <math> h_j ∈ Ω </math> satisfying the pricing problem. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Solve pricing problem. </li><br />
<li> &emsp; '''until''' <math> h_j </math> is not found </li> <br />
<li> &emsp; Let <math> (\widetilde{λ} , \widetilde{z}) </math> be the final solution of the primal IP with base learners <math> \widetilde{L} = \{j | \widetilde{λ}_j > 0\} </math> </li><br />
<li> &emsp; '''if''' <math> \widetilde{z} ∈ \mathbb{Z}^N </math> and <math> \sum^{N}_{i=1}\widetilde{z}_i < U </math> '''then''' </li><br />
<li> &emsp; &emsp; <math> U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Update best solution. </li><br />
<li> &emsp; '''else''' </li><br />
<li> &emsp; &emsp; Choose <math> i ∈ [N] </math> with <math> \widetilde{z}_i \notin Z </math> </li><br />
<li> &emsp; &emsp; Set <math> B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} </math> </li><br />
<li> &emsp; &emsp; Add <math> (B_0,\widetilde{L}), (B_1,\widetilde{L}) </math> to <math> T </math>. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Create new branching nodes. </li><br />
<li> &emsp; '''end''' if </li><br />
<li> '''end''' while </li><br />
<li> ''Optionally sparsify final solution <math>L^*</math>'' </li><br />
<br />
</ol><br />
</div><br />
<br />
===Sparsification===<br />
<br />
When building the boosting model, one of the challenges is to balance model accuracy and generalization. A sparse model is generally easier to interpret. In order to control complexity, overfitting, and generalization of the model, typically some sparsity is applied, which is easier to interpret for some applications. The goal of sparsification is to minimize the following: \begin{equation}\sum_{i=1}^{N}z_i+\sum_{i=1}^{L}\alpha_iy_i\end{equation} There are two techniques commonly used. One is early stopping, another is regularization by adding a complexity term for the learners in the objective function. Previous approaches in the context of LP-based boosting have promoted sparsity by means of cutting planes. Sparsification can be handled in the approach introduced in this paper by solving a delayed integer program using additional cutting planes.<br />
<br />
== Results and Performance ==<br />
<br />
''All tests were run on identical Linux clusters with Intel Xeon Quad Core CPUs, with 3.50GHz, 10 MB cache, and 32 GB of main memory.''<br />
<br />
<br />
The following results reflect IPBoost's performance in hard instances. Note that by hard instances, we mean a binary classification problem with predefined labels. These examples are tailored to using the ±1 classification from learners. On every hard instance sample, IPBoost significantly outperforms both LPBoost and AdaBoost (although implementations depending on the libraries used have often caused results to differ slightly). For the considered instances the best value for the margin ρ was 0.05 for LPBoost and IPBoost; AdaBoost has no margin parameter. The accuracy reported is test accuracy recorded across various different walkthroughs of the algorithm, while <math>L </math> denotes the aggregate number of learners required to find the optimal learner, N is the number of points and <math> \gamma </math> refers to the noise level.<br />
<br />
[[File:ipboostres.png|center]]<br />
<br />
<br />
<br />
For the next table, the classification instances from LIBSVM data sets available at [https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/]. We report accuracies on the test set and train set, respectively. In each case, we report the averages of the accuracies over 10 runs with a different random seed and their standard deviations. We can see IPboost again outperforming LPBoost and AdaBoost significantly. Solving Integer Programming problems is no doubt more computationally expensive than traditional boosting methods like AdaBoost. The average run time of IPBoost (for ρ = 0.05) being 1367.78 seconds, as opposed to LPBoost's 164.35 seconds and AdaBoost's 3.59 seconds reflects exactly that. However, on the flip side, we gain much better stability in our results, as well as higher scores across the board for both training and test sets.<br />
<br />
[[file:svmlibres.png|center]]<br />
<br />
<br />
== Conclusion ==<br />
<br />
IP-boosting avoids the bad performance on well-known hard classes and improves upon LP-boosting and AdaBoost on the LIBSVM instances where even a few percent improvements is valuable. The major drawback is that the running time with the current implementation is much longer. Nevertheless, the algorithm can be improved in the future by solving the intermediate LPs only approximately and deriving tailored heuristics that generate decent primal solutions to save on time.<br />
<br />
The approach is suited very well to an offline setting in which training may take time and where even a small improvement is beneficial or when convex boosters have egregious behaviour. It can also be served as a tool to investigate the general performance of methods like this.<br />
<br />
The IPBoost algorithm added extra complexity into basic boosting models to gain slight accuracy gain while greatly increased the time spent. It is 381 times slower compared to an AdaBoost model on a small dataset which makes the actual usage of this model doubtable. If we are supplied with a larger dataset with millions of records this model would take too long to complete. The base classifier choice was XGBoost which is too complicated for a base classifier, maybe try some weaker learners such as tree stumps to compare the result with other models. In addition, this model might not be accurate compared to the model-ensembling technique where each model utilizes a different algorithm.<br />
<br />
I think IP Boost can be helpful in cases where a small improvement in accuracy is worth the additional computational effort involved. For example, there are applications in Finance where predicting the movement of stock prices slightly more accurately can help traders make millions more. In such cases, investing in expensive computing systems to implement such boosting techniques can perhaps be justified.<br />
<br />
Although the IPBoost can generate better results compared to the base boosting models such as AdaBoost. The time complexity of IPBoost is much higher than other models. Hence, it will be problem to apply IPBoost to the real-world training data since those training data will be large and it's hard to deal with. Thus, we are left with the question that this IPBoost is only a little bit more accurate than the base boosting models but a lot more time complex. So is the model really worth it?<br />
<br />
== Critique ==<br />
<br />
In practise AdaBoost & LP Boost are quite robust to over fitting. In practise if you use simple weak learns stumps (1-level decision trees) then the algorithms are much less prone to over fitting. However the noise level in the data, particularly for AdaBoost is prone to overfitting on noisy datasets. To avoid this use regularised models (AdaBoostReg, LPBoost). Further, when in higher dimensional spaces, AdaBoost can suffer, since its just a linear combination of classifiers. You can use k-fold cross validation to set the stopping parameter to improve this.<br />
<br />
== References ==<br />
<br />
* Pfetsch, M. E., & Pokutta, S. (2020). IPBoost--Non-Convex Boosting via Integer Programming. arXiv preprint arXiv:2002.04679.<br />
<br />
* Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23-37). Springer, Berlin, Heidelberg. pdf</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=IPBoost&diff=48410IPBoost2020-11-30T08:05:00Z<p>A29mukhe: /* Sparsification */</p>
<hr />
<div>== Presented by == <br />
Casey De Vera, Solaiman Jawad<br />
<br />
== Introduction == <br />
Boosting is an important and a fairly standard technique in a classification that combines several base learners (learners which have a “low accuracy”), into one boosted learner (learner with a “high accuracy”). Pioneered by the AdaBoost approach of Freund & Schapire, in recent decades there has been extensive work on boosting procedures and analyses of their limitations.<br />
<br />
In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows:<br />
<br />
Let <math>T</math> represents the number of iteration and <math>t</math> represents the iterator. <br />
<br />
for <math> t= 1, \cdots, T </math> do the following:<br />
<br />
::1. Train a learner <math> \mu_t</math> from a given class of base learners on the data distribution <math> \mathcal D_t</math><br />
<br />
::2. Evaluate the performance of <math> \mu_t</math> by computing its loss.<br />
<br />
::3. Push weight of the data distribution <math> \mathcal D_t</math> towards the misclassified examples leading to <math> \mathcal D_{t+1}</math>, usually a relatively good model will have a higher weight on those data that misclassified.<br />
<br />
Finally, the learners are combined with some form of voting (e.g., soft or hard voting, averaging, thresholding).<br />
<br />
<br />
[[File:boosting.gif|200px|thumb|right]] A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function by means of coordinate gradient descent. Boosting schemes of this type are often referred to as '''convex potential boosters'''. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. However, they can be defeated easily by a small number of incorrect labels which can be quite difficult to fix. The reason being is that we try to solve misclassified examples by moving the weights around, which results in bad performance on unseen data (Shown by zoomed example to the right). In fact, in theory, provided the class of base learners is rich enough, a perfect strong learner can be constructed that has accuracy 1, however, clearly, such a learner might not necessarily generalize well. Boosted learners can generate some quite complicated decision boundaries, much more complicated than that of the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line and XGBoost was used to learn a classifier:<br />
<br />
<br />
Recently non-convex optimization approaches for solving machine learning problems have gained significant attention. In this paper, the non-convex boosting in classification using integer programming has been explored, and the real-world practicability of the approach while circumventing shortcomings of convex boosting approaches is also shown. The paper reports results that are comparable to or better than current state-of-the-art approaches.<br />
<br />
== Motivation ==<br />
<br />
In reality, we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. Noisy labels are an issue because when a model is trained with excessive amounts of noisy labels, the performance and accuracy deteriorate greatly. However, if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case, these examples cannot be correctly classified and the procedure shifts more and more weight towards these bad examples. This eventually leads to a strong learner, that perfectly predicts the (flawed) training data; however, that does not generalize well anymore. This intuition has been formalized by [LS] who construct a “hard” training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in performance of these boosted learners; see tables below. The more technical reason for this problem is actually the convexity of the loss function that is minimized by the boosting procedure. Clearly, one can use all types of “tricks” such as early stopping but at the end of the day, this is not solving the fundamental problem.<br />
<br />
== IPBoost: Boosting via Integer Programming ==<br />
<br />
<br />
===Integer Program Formulation===<br />
Let <math>(x_1,y_1),\cdots, (x_N,y_N) </math> be the training set with points <math>x_i \in \mathbb{R}^d</math> and two-class labels <math>y_i \in \{\pm 1\}</math> <br />
* class of base learners: <math> \Omega :=\{h_1, \cdots, h_L: \mathbb{R}^d \rightarrow \{\pm 1\}\} </math> and <math>\rho \ge 0</math> be given. <br />
* error function <math> \eta </math><br />
Our boosting model is captured by the integer programming problem. We can call this our primal problem: <br />
<br />
$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j=1}^L \eta_{ij}\lambda_k+(1+\rho)z_i \ge \rho \ \ \ <br />
\forall i=1,\cdots, N \\ <br />
&\sum_{j=1}^L \lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$<br />
<br />
For the error function <math>\eta</math>, three options were considered:<br />
<br />
(i) <math> \pm 1 </math> classification from learners<br />
$$ \eta_{ij} := 2\mathbb{I}[h_j(x_i) = y_i] - 1 = y_i \cdot h_j(x_i) $$<br />
(ii) class probabilities of learners<br />
$$ \eta_{ij} := 2\mathbb{P}[h_j(x_i) = y_i] - 1$$<br />
(iii) SAMME.R error function for learners<br />
$$ \frac{1}{2}y_i\log\left(\frac{\mathbb{P}[h_j(x_i) = 1]}{\mathbb{P}[h_j(x_i) = -1]}\right)$$<br />
<br />
===Solution of the IP using Column Generation===<br />
<br />
The goal of column generation is to provide an efficient way to solve the linear programming relaxation of the primal by allowing the <math>z_i </math> variables to assume fractional values. Moreover, columns, i.e., the base learners, <math> \mathcal L \subseteq [L]. </math> are left out because there are too many to handle efficiently and most of them will have their associated weight equal to zero in the optimal solution anyway. To generate columns, a <i>branch and bound</i> framework is used. Columns are generated within a<br />
branch-and-bound framework leading effectively to a branch-and-bound-and-price algorithm being used; this is significantly more involved compared to column generation in linear programming. To check the optimality of an LP solution, a subproblem, called the pricing problem, is solved to try to identify columns with a profitable reduced cost. If such columns are found, the LP is re-optimized. Branching occurs when no profitable columns are found, but the LP solution does not satisfy the integrality conditions. Branch and price apply column generation at every node of the branch and bound tree.<br />
<br />
The restricted master primal problem is <br />
<br />
$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j\in \mathcal L} \eta_{ij}\lambda_j+(1+\rho)z_i \ge \rho \ \ \ <br />
\forall i \in [N]\\ <br />
&\sum_{j\in \mathcal L}\lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$<br />
<br />
<br />
Its restricted dual problem is:<br />
<br />
$$ \begin{align*}\max \rho &\sum^{N}_{i=1}w_i + v - \sum^{N}_{i=1}u_i<br />
\\ s.t. &\sum_{i=1}^N \eta_{ij}w_k+ v \le 0 \ \ \ \forall j \in L \\ <br />
&(1+\rho)w_i - u_i \le 1 \ \ \ \forall i \in [N] \\ &w \ge 0, u \ge 0, v\ free\end{align*}$$<br />
<br />
Furthermore, there is a pricing problem used to determine, for every supposed optimal solution of the dual, whether the solution is actually optimal, or whether further constraints need to be added into the primal solution. With this pricing problem, we check whether the solution to the restricted dual is feasible. This pricing problem can be expressed as follows:<br />
<br />
$$ \sum_{i=1}^N \eta_{ij}w_k^* + v^* > 0 $$<br />
<br />
The optimal misclassification values are determined by a branch-and-price process that branches on the variables <math> z_i </math> and solves the intermediate LPs using column generation.<br />
<br />
===Algorithm===<br />
<div style="margin-left: 3em;"><br />
<math> D = \{(x_i, y_i) | i ∈ I\} ⊆ R^d × \{±1\} </math>, class of base learners <math>Ω </math>, margin <math> \rho </math> <br><br />
'''Output:''' Boosted learner <math> \sum_{j∈L^∗}h_jλ_j^* </math> with base learners <math> h_j </math> and weights <math> λ_j^* </math> <br><br />
<br />
<ol><br />
<br />
<li margin-left:30px> <math> T ← \{([0, 1]^N, \emptyset)\} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // set of local bounds and learners for open subproblems </li><br />
<li> <math> U ← \infty, L^∗ ← \emptyset </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Upper bound on optimal objective </li><br />
<li> '''while''' <math>\ T \neq \emptyset </math> '''do''' </li><br />
<li> &emsp; Choose and remove <math>(B,L) </math> from <math>T </math> </li><br />
<li> &emsp; '''repeat''' </li><br />
<li> &emsp; &emsp; Solve the primal IP using the local bounds on <math> z </math> in <math>B</math> with optimal dual solution <math> (w^∗, v^∗, u^∗) </math> </li><br />
<li> &emsp; &emsp; Find learner <math> h_j ∈ Ω </math> satisfying the pricing problem. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Solve pricing problem. </li><br />
<li> &emsp; '''until''' <math> h_j </math> is not found </li> <br />
<li> &emsp; Let <math> (\widetilde{λ} , \widetilde{z}) </math> be the final solution of the primal IP with base learners <math> \widetilde{L} = \{j | \widetilde{λ}_j > 0\} </math> </li><br />
<li> &emsp; '''if''' <math> \widetilde{z} ∈ \mathbb{Z}^N </math> and <math> \sum^{N}_{i=1}\widetilde{z}_i < U </math> '''then''' </li><br />
<li> &emsp; &emsp; <math> U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Update best solution. </li><br />
<li> &emsp; '''else''' </li><br />
<li> &emsp; &emsp; Choose <math> i ∈ [N] </math> with <math> \widetilde{z}_i \notin Z </math> </li><br />
<li> &emsp; &emsp; Set <math> B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} </math> </li><br />
<li> &emsp; &emsp; Add <math> (B_0,\widetilde{L}), (B_1,\widetilde{L}) </math> to <math> T </math>. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Create new branching nodes. </li><br />
<li> &emsp; '''end''' if </li><br />
<li> '''end''' while </li><br />
<li> ''Optionally sparsify final solution <math>L^*</math>'' </li><br />
<br />
</ol><br />
</div><br />
<br />
===Sparsification===<br />
<br />
When building the boosting model, one of the challenges is to balance model accuracy and generalization. A sparse model is generally easier to interpret. In order to control complexity, overfitting, and generalization of the model, typically some sparsity is applied, which is easier to interpret for some applications. The goal of sparsification is to minimize \begin{equation}\sum_{i=1}^{N}z_i+\sum_{i=1}^{L}\alpha_iy_i\end{equation} There are two techniques commonly used. One is early stopping, another is regularization by adding a complexity term for the learners in the objective function. Previous approaches in the context of LP-based boosting have promoted sparsity by means of cutting planes. Sparsification can be handled in the approach introduced in this paper by solving a delayed integer program using additional cutting planes.<br />
<br />
== Results and Performance ==<br />
<br />
''All tests were run on identical Linux clusters with Intel Xeon Quad Core CPUs, with 3.50GHz, 10 MB cache, and 32 GB of main memory.''<br />
<br />
<br />
The following results reflect IPBoost's performance in hard instances. Note that by hard instances, we mean a binary classification problem with predefined labels. These examples are tailored to using the ±1 classification from learners. On every hard instance sample, IPBoost significantly outperforms both LPBoost and AdaBoost (although implementations depending on the libraries used have often caused results to differ slightly). For the considered instances the best value for the margin ρ was 0.05 for LPBoost and IPBoost; AdaBoost has no margin parameter. The accuracy reported is test accuracy recorded across various different walkthroughs of the algorithm, while <math>L </math> denotes the aggregate number of learners required to find the optimal learner, N is the number of points and <math> \gamma </math> refers to the noise level.<br />
<br />
[[File:ipboostres.png|center]]<br />
<br />
<br />
<br />
For the next table, the classification instances from LIBSVM data sets available at [https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/]. We report accuracies on the test set and train set, respectively. In each case, we report the averages of the accuracies over 10 runs with a different random seed and their standard deviations. We can see IPboost again outperforming LPBoost and AdaBoost significantly. Solving Integer Programming problems is no doubt more computationally expensive than traditional boosting methods like AdaBoost. The average run time of IPBoost (for ρ = 0.05) being 1367.78 seconds, as opposed to LPBoost's 164.35 seconds and AdaBoost's 3.59 seconds reflects exactly that. However, on the flip side, we gain much better stability in our results, as well as higher scores across the board for both training and test sets.<br />
<br />
[[file:svmlibres.png|center]]<br />
<br />
<br />
== Conclusion ==<br />
<br />
IP-boosting avoids the bad performance on well-known hard classes and improves upon LP-boosting and AdaBoost on the LIBSVM instances where even a few percent improvements is valuable. The major drawback is that the running time with the current implementation is much longer. Nevertheless, the algorithm can be improved in the future by solving the intermediate LPs only approximately and deriving tailored heuristics that generate decent primal solutions to save on time.<br />
<br />
The approach is suited very well to an offline setting in which training may take time and where even a small improvement is beneficial or when convex boosters have egregious behaviour. It can also be served as a tool to investigate the general performance of methods like this.<br />
<br />
The IPBoost algorithm added extra complexity into basic boosting models to gain slight accuracy gain while greatly increased the time spent. It is 381 times slower compared to an AdaBoost model on a small dataset which makes the actual usage of this model doubtable. If we are supplied with a larger dataset with millions of records this model would take too long to complete. The base classifier choice was XGBoost which is too complicated for a base classifier, maybe try some weaker learners such as tree stumps to compare the result with other models. In addition, this model might not be accurate compared to the model-ensembling technique where each model utilizes a different algorithm.<br />
<br />
I think IP Boost can be helpful in cases where a small improvement in accuracy is worth the additional computational effort involved. For example, there are applications in Finance where predicting the movement of stock prices slightly more accurately can help traders make millions more. In such cases, investing in expensive computing systems to implement such boosting techniques can perhaps be justified.<br />
<br />
Although the IPBoost can generate better results compared to the base boosting models such as AdaBoost. The time complexity of IPBoost is much higher than other models. Hence, it will be problem to apply IPBoost to the real-world training data since those training data will be large and it's hard to deal with. Thus, we are left with the question that this IPBoost is only a little bit more accurate than the base boosting models but a lot more time complex. So is the model really worth it?<br />
<br />
== Critique ==<br />
<br />
In practise AdaBoost & LP Boost are quite robust to over fitting. In practise if you use simple weak learns stumps (1-level decision trees) then the algorithms are much less prone to over fitting. However the noise level in the data, particularly for AdaBoost is prone to overfitting on noisy datasets. To avoid this use regularised models (AdaBoostReg, LPBoost). Further, when in higher dimensional spaces, AdaBoost can suffer, since its just a linear combination of classifiers. You can use k-fold cross validation to set the stopping parameter to improve this.<br />
<br />
== References ==<br />
<br />
* Pfetsch, M. E., & Pokutta, S. (2020). IPBoost--Non-Convex Boosting via Integer Programming. arXiv preprint arXiv:2002.04679.<br />
<br />
* Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23-37). Springer, Berlin, Heidelberg. pdf</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=IPBoost&diff=48409IPBoost2020-11-30T08:03:02Z<p>A29mukhe: /* Sparsification */</p>
<hr />
<div>== Presented by == <br />
Casey De Vera, Solaiman Jawad<br />
<br />
== Introduction == <br />
Boosting is an important and a fairly standard technique in a classification that combines several base learners (learners which have a “low accuracy”), into one boosted learner (learner with a “high accuracy”). Pioneered by the AdaBoost approach of Freund & Schapire, in recent decades there has been extensive work on boosting procedures and analyses of their limitations.<br />
<br />
In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows:<br />
<br />
Let <math>T</math> represents the number of iteration and <math>t</math> represents the iterator. <br />
<br />
for <math> t= 1, \cdots, T </math> do the following:<br />
<br />
::1. Train a learner <math> \mu_t</math> from a given class of base learners on the data distribution <math> \mathcal D_t</math><br />
<br />
::2. Evaluate the performance of <math> \mu_t</math> by computing its loss.<br />
<br />
::3. Push weight of the data distribution <math> \mathcal D_t</math> towards the misclassified examples leading to <math> \mathcal D_{t+1}</math>, usually a relatively good model will have a higher weight on those data that misclassified.<br />
<br />
Finally, the learners are combined with some form of voting (e.g., soft or hard voting, averaging, thresholding).<br />
<br />
<br />
[[File:boosting.gif|200px|thumb|right]] A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function by means of coordinate gradient descent. Boosting schemes of this type are often referred to as '''convex potential boosters'''. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. However, they can be defeated easily by a small number of incorrect labels which can be quite difficult to fix. The reason being is that we try to solve misclassified examples by moving the weights around, which results in bad performance on unseen data (Shown by zoomed example to the right). In fact, in theory, provided the class of base learners is rich enough, a perfect strong learner can be constructed that has accuracy 1, however, clearly, such a learner might not necessarily generalize well. Boosted learners can generate some quite complicated decision boundaries, much more complicated than that of the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line and XGBoost was used to learn a classifier:<br />
<br />
<br />
Recently non-convex optimization approaches for solving machine learning problems have gained significant attention. In this paper, the non-convex boosting in classification using integer programming has been explored, and the real-world practicability of the approach while circumventing shortcomings of convex boosting approaches is also shown. The paper reports results that are comparable to or better than current state-of-the-art approaches.<br />
<br />
== Motivation ==<br />
<br />
In reality, we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. Noisy labels are an issue because when a model is trained with excessive amounts of noisy labels, the performance and accuracy deteriorate greatly. However, if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case, these examples cannot be correctly classified and the procedure shifts more and more weight towards these bad examples. This eventually leads to a strong learner, that perfectly predicts the (flawed) training data; however, that does not generalize well anymore. This intuition has been formalized by [LS] who construct a “hard” training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in performance of these boosted learners; see tables below. The more technical reason for this problem is actually the convexity of the loss function that is minimized by the boosting procedure. Clearly, one can use all types of “tricks” such as early stopping but at the end of the day, this is not solving the fundamental problem.<br />
<br />
== IPBoost: Boosting via Integer Programming ==<br />
<br />
<br />
===Integer Program Formulation===<br />
Let <math>(x_1,y_1),\cdots, (x_N,y_N) </math> be the training set with points <math>x_i \in \mathbb{R}^d</math> and two-class labels <math>y_i \in \{\pm 1\}</math> <br />
* class of base learners: <math> \Omega :=\{h_1, \cdots, h_L: \mathbb{R}^d \rightarrow \{\pm 1\}\} </math> and <math>\rho \ge 0</math> be given. <br />
* error function <math> \eta </math><br />
Our boosting model is captured by the integer programming problem. We can call this our primal problem: <br />
<br />
$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j=1}^L \eta_{ij}\lambda_k+(1+\rho)z_i \ge \rho \ \ \ <br />
\forall i=1,\cdots, N \\ <br />
&\sum_{j=1}^L \lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$<br />
<br />
For the error function <math>\eta</math>, three options were considered:<br />
<br />
(i) <math> \pm 1 </math> classification from learners<br />
$$ \eta_{ij} := 2\mathbb{I}[h_j(x_i) = y_i] - 1 = y_i \cdot h_j(x_i) $$<br />
(ii) class probabilities of learners<br />
$$ \eta_{ij} := 2\mathbb{P}[h_j(x_i) = y_i] - 1$$<br />
(iii) SAMME.R error function for learners<br />
$$ \frac{1}{2}y_i\log\left(\frac{\mathbb{P}[h_j(x_i) = 1]}{\mathbb{P}[h_j(x_i) = -1]}\right)$$<br />
<br />
===Solution of the IP using Column Generation===<br />
<br />
The goal of column generation is to provide an efficient way to solve the linear programming relaxation of the primal by allowing the <math>z_i </math> variables to assume fractional values. Moreover, columns, i.e., the base learners, <math> \mathcal L \subseteq [L]. </math> are left out because there are too many to handle efficiently and most of them will have their associated weight equal to zero in the optimal solution anyway. To generate columns, a <i>branch and bound</i> framework is used. Columns are generated within a<br />
branch-and-bound framework leading effectively to a branch-and-bound-and-price algorithm being used; this is significantly more involved compared to column generation in linear programming. To check the optimality of an LP solution, a subproblem, called the pricing problem, is solved to try to identify columns with a profitable reduced cost. If such columns are found, the LP is re-optimized. Branching occurs when no profitable columns are found, but the LP solution does not satisfy the integrality conditions. Branch and price apply column generation at every node of the branch and bound tree.<br />
<br />
The restricted master primal problem is <br />
<br />
$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j\in \mathcal L} \eta_{ij}\lambda_j+(1+\rho)z_i \ge \rho \ \ \ <br />
\forall i \in [N]\\ <br />
&\sum_{j\in \mathcal L}\lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$<br />
<br />
<br />
Its restricted dual problem is:<br />
<br />
$$ \begin{align*}\max \rho &\sum^{N}_{i=1}w_i + v - \sum^{N}_{i=1}u_i<br />
\\ s.t. &\sum_{i=1}^N \eta_{ij}w_k+ v \le 0 \ \ \ \forall j \in L \\ <br />
&(1+\rho)w_i - u_i \le 1 \ \ \ \forall i \in [N] \\ &w \ge 0, u \ge 0, v\ free\end{align*}$$<br />
<br />
Furthermore, there is a pricing problem used to determine, for every supposed optimal solution of the dual, whether the solution is actually optimal, or whether further constraints need to be added into the primal solution. With this pricing problem, we check whether the solution to the restricted dual is feasible. This pricing problem can be expressed as follows:<br />
<br />
$$ \sum_{i=1}^N \eta_{ij}w_k^* + v^* > 0 $$<br />
<br />
The optimal misclassification values are determined by a branch-and-price process that branches on the variables <math> z_i </math> and solves the intermediate LPs using column generation.<br />
<br />
===Algorithm===<br />
<div style="margin-left: 3em;"><br />
<math> D = \{(x_i, y_i) | i ∈ I\} ⊆ R^d × \{±1\} </math>, class of base learners <math>Ω </math>, margin <math> \rho </math> <br><br />
'''Output:''' Boosted learner <math> \sum_{j∈L^∗}h_jλ_j^* </math> with base learners <math> h_j </math> and weights <math> λ_j^* </math> <br><br />
<br />
<ol><br />
<br />
<li margin-left:30px> <math> T ← \{([0, 1]^N, \emptyset)\} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // set of local bounds and learners for open subproblems </li><br />
<li> <math> U ← \infty, L^∗ ← \emptyset </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Upper bound on optimal objective </li><br />
<li> '''while''' <math>\ T \neq \emptyset </math> '''do''' </li><br />
<li> &emsp; Choose and remove <math>(B,L) </math> from <math>T </math> </li><br />
<li> &emsp; '''repeat''' </li><br />
<li> &emsp; &emsp; Solve the primal IP using the local bounds on <math> z </math> in <math>B</math> with optimal dual solution <math> (w^∗, v^∗, u^∗) </math> </li><br />
<li> &emsp; &emsp; Find learner <math> h_j ∈ Ω </math> satisfying the pricing problem. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Solve pricing problem. </li><br />
<li> &emsp; '''until''' <math> h_j </math> is not found </li> <br />
<li> &emsp; Let <math> (\widetilde{λ} , \widetilde{z}) </math> be the final solution of the primal IP with base learners <math> \widetilde{L} = \{j | \widetilde{λ}_j > 0\} </math> </li><br />
<li> &emsp; '''if''' <math> \widetilde{z} ∈ \mathbb{Z}^N </math> and <math> \sum^{N}_{i=1}\widetilde{z}_i < U </math> '''then''' </li><br />
<li> &emsp; &emsp; <math> U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Update best solution. </li><br />
<li> &emsp; '''else''' </li><br />
<li> &emsp; &emsp; Choose <math> i ∈ [N] </math> with <math> \widetilde{z}_i \notin Z </math> </li><br />
<li> &emsp; &emsp; Set <math> B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} </math> </li><br />
<li> &emsp; &emsp; Add <math> (B_0,\widetilde{L}), (B_1,\widetilde{L}) </math> to <math> T </math>. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Create new branching nodes. </li><br />
<li> &emsp; '''end''' if </li><br />
<li> '''end''' while </li><br />
<li> ''Optionally sparsify final solution <math>L^*</math>'' </li><br />
<br />
</ol><br />
</div><br />
<br />
===Sparsification===<br />
<br />
When building the boosting model, one of the challenges is to balance model accuracy and generalization. A sparse model is generally easier to interpret. In order to control complexity, overfitting, and generalization of the model, typically some sparsity is applied, which is easier to interpret for some applications. There are two techniques commonly used. One is early stopping, another is regularization by adding a complexity term for the learners in the objective function. Previous approaches in the context of LP-based boosting have promoted sparsity by means of cutting planes. Sparsification can be handled in the approach introduced in this paper by solving a delayed integer program using additional cutting planes.<br />
<br />
== Results and Performance ==<br />
<br />
''All tests were run on identical Linux clusters with Intel Xeon Quad Core CPUs, with 3.50GHz, 10 MB cache, and 32 GB of main memory.''<br />
<br />
<br />
The following results reflect IPBoost's performance in hard instances. Note that by hard instances, we mean a binary classification problem with predefined labels. These examples are tailored to using the ±1 classification from learners. On every hard instance sample, IPBoost significantly outperforms both LPBoost and AdaBoost (although implementations depending on the libraries used have often caused results to differ slightly). For the considered instances the best value for the margin ρ was 0.05 for LPBoost and IPBoost; AdaBoost has no margin parameter. The accuracy reported is test accuracy recorded across various different walkthroughs of the algorithm, while <math>L </math> denotes the aggregate number of learners required to find the optimal learner, N is the number of points and <math> \gamma </math> refers to the noise level.<br />
<br />
[[File:ipboostres.png|center]]<br />
<br />
<br />
<br />
For the next table, the classification instances from LIBSVM data sets available at [https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/]. We report accuracies on the test set and train set, respectively. In each case, we report the averages of the accuracies over 10 runs with a different random seed and their standard deviations. We can see IPboost again outperforming LPBoost and AdaBoost significantly. Solving Integer Programming problems is no doubt more computationally expensive than traditional boosting methods like AdaBoost. The average run time of IPBoost (for ρ = 0.05) being 1367.78 seconds, as opposed to LPBoost's 164.35 seconds and AdaBoost's 3.59 seconds reflects exactly that. However, on the flip side, we gain much better stability in our results, as well as higher scores across the board for both training and test sets.<br />
<br />
[[file:svmlibres.png|center]]<br />
<br />
<br />
== Conclusion ==<br />
<br />
IP-boosting avoids the bad performance on well-known hard classes and improves upon LP-boosting and AdaBoost on the LIBSVM instances where even a few percent improvements is valuable. The major drawback is that the running time with the current implementation is much longer. Nevertheless, the algorithm can be improved in the future by solving the intermediate LPs only approximately and deriving tailored heuristics that generate decent primal solutions to save on time.<br />
<br />
The approach is suited very well to an offline setting in which training may take time and where even a small improvement is beneficial or when convex boosters have egregious behaviour. It can also be served as a tool to investigate the general performance of methods like this.<br />
<br />
The IPBoost algorithm added extra complexity into basic boosting models to gain slight accuracy gain while greatly increased the time spent. It is 381 times slower compared to an AdaBoost model on a small dataset which makes the actual usage of this model doubtable. If we are supplied with a larger dataset with millions of records this model would take too long to complete. The base classifier choice was XGBoost which is too complicated for a base classifier, maybe try some weaker learners such as tree stumps to compare the result with other models. In addition, this model might not be accurate compared to the model-ensembling technique where each model utilizes a different algorithm.<br />
<br />
I think IP Boost can be helpful in cases where a small improvement in accuracy is worth the additional computational effort involved. For example, there are applications in Finance where predicting the movement of stock prices slightly more accurately can help traders make millions more. In such cases, investing in expensive computing systems to implement such boosting techniques can perhaps be justified.<br />
<br />
Although the IPBoost can generate better results compared to the base boosting models such as AdaBoost. The time complexity of IPBoost is much higher than other models. Hence, it will be problem to apply IPBoost to the real-world training data since those training data will be large and it's hard to deal with. Thus, we are left with the question that this IPBoost is only a little bit more accurate than the base boosting models but a lot more time complex. So is the model really worth it?<br />
<br />
== Critique ==<br />
<br />
In practise AdaBoost & LP Boost are quite robust to over fitting. In practise if you use simple weak learns stumps (1-level decision trees) then the algorithms are much less prone to over fitting. However the noise level in the data, particularly for AdaBoost is prone to overfitting on noisy datasets. To avoid this use regularised models (AdaBoostReg, LPBoost). Further, when in higher dimensional spaces, AdaBoost can suffer, since its just a linear combination of classifiers. You can use k-fold cross validation to set the stopping parameter to improve this.<br />
<br />
== References ==<br />
<br />
* Pfetsch, M. E., & Pokutta, S. (2020). IPBoost--Non-Convex Boosting via Integer Programming. arXiv preprint arXiv:2002.04679.<br />
<br />
* Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23-37). Springer, Berlin, Heidelberg. pdf</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=IPBoost&diff=48408IPBoost2020-11-30T08:00:50Z<p>A29mukhe: /* Sparsification */</p>
<hr />
<div>== Presented by == <br />
Casey De Vera, Solaiman Jawad<br />
<br />
== Introduction == <br />
Boosting is an important and a fairly standard technique in a classification that combines several base learners (learners which have a “low accuracy”), into one boosted learner (learner with a “high accuracy”). Pioneered by the AdaBoost approach of Freund & Schapire, in recent decades there has been extensive work on boosting procedures and analyses of their limitations.<br />
<br />
In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows:<br />
<br />
Let <math>T</math> represents the number of iteration and <math>t</math> represents the iterator. <br />
<br />
for <math> t= 1, \cdots, T </math> do the following:<br />
<br />
::1. Train a learner <math> \mu_t</math> from a given class of base learners on the data distribution <math> \mathcal D_t</math><br />
<br />
::2. Evaluate the performance of <math> \mu_t</math> by computing its loss.<br />
<br />
::3. Push weight of the data distribution <math> \mathcal D_t</math> towards the misclassified examples leading to <math> \mathcal D_{t+1}</math>, usually a relatively good model will have a higher weight on those data that misclassified.<br />
<br />
Finally, the learners are combined with some form of voting (e.g., soft or hard voting, averaging, thresholding).<br />
<br />
<br />
[[File:boosting.gif|200px|thumb|right]] A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function by means of coordinate gradient descent. Boosting schemes of this type are often referred to as '''convex potential boosters'''. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. However, they can be defeated easily by a small number of incorrect labels which can be quite difficult to fix. The reason being is that we try to solve misclassified examples by moving the weights around, which results in bad performance on unseen data (Shown by zoomed example to the right). In fact, in theory, provided the class of base learners is rich enough, a perfect strong learner can be constructed that has accuracy 1, however, clearly, such a learner might not necessarily generalize well. Boosted learners can generate some quite complicated decision boundaries, much more complicated than that of the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line and XGBoost was used to learn a classifier:<br />
<br />
<br />
Recently non-convex optimization approaches for solving machine learning problems have gained significant attention. In this paper, the non-convex boosting in classification using integer programming has been explored, and the real-world practicability of the approach while circumventing shortcomings of convex boosting approaches is also shown. The paper reports results that are comparable to or better than current state-of-the-art approaches.<br />
<br />
== Motivation ==<br />
<br />
In reality, we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. Noisy labels are an issue because when a model is trained with excessive amounts of noisy labels, the performance and accuracy deteriorate greatly. However, if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case, these examples cannot be correctly classified and the procedure shifts more and more weight towards these bad examples. This eventually leads to a strong learner, that perfectly predicts the (flawed) training data; however, that does not generalize well anymore. This intuition has been formalized by [LS] who construct a “hard” training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in performance of these boosted learners; see tables below. The more technical reason for this problem is actually the convexity of the loss function that is minimized by the boosting procedure. Clearly, one can use all types of “tricks” such as early stopping but at the end of the day, this is not solving the fundamental problem.<br />
<br />
== IPBoost: Boosting via Integer Programming ==<br />
<br />
<br />
===Integer Program Formulation===<br />
Let <math>(x_1,y_1),\cdots, (x_N,y_N) </math> be the training set with points <math>x_i \in \mathbb{R}^d</math> and two-class labels <math>y_i \in \{\pm 1\}</math> <br />
* class of base learners: <math> \Omega :=\{h_1, \cdots, h_L: \mathbb{R}^d \rightarrow \{\pm 1\}\} </math> and <math>\rho \ge 0</math> be given. <br />
* error function <math> \eta </math><br />
Our boosting model is captured by the integer programming problem. We can call this our primal problem: <br />
<br />
$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j=1}^L \eta_{ij}\lambda_k+(1+\rho)z_i \ge \rho \ \ \ <br />
\forall i=1,\cdots, N \\ <br />
&\sum_{j=1}^L \lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$<br />
<br />
For the error function <math>\eta</math>, three options were considered:<br />
<br />
(i) <math> \pm 1 </math> classification from learners<br />
$$ \eta_{ij} := 2\mathbb{I}[h_j(x_i) = y_i] - 1 = y_i \cdot h_j(x_i) $$<br />
(ii) class probabilities of learners<br />
$$ \eta_{ij} := 2\mathbb{P}[h_j(x_i) = y_i] - 1$$<br />
(iii) SAMME.R error function for learners<br />
$$ \frac{1}{2}y_i\log\left(\frac{\mathbb{P}[h_j(x_i) = 1]}{\mathbb{P}[h_j(x_i) = -1]}\right)$$<br />
<br />
===Solution of the IP using Column Generation===<br />
<br />
The goal of column generation is to provide an efficient way to solve the linear programming relaxation of the primal by allowing the <math>z_i </math> variables to assume fractional values. Moreover, columns, i.e., the base learners, <math> \mathcal L \subseteq [L]. </math> are left out because there are too many to handle efficiently and most of them will have their associated weight equal to zero in the optimal solution anyway. To generate columns, a <i>branch and bound</i> framework is used. Columns are generated within a<br />
branch-and-bound framework leading effectively to a branch-and-bound-and-price algorithm being used; this is significantly more involved compared to column generation in linear programming. To check the optimality of an LP solution, a subproblem, called the pricing problem, is solved to try to identify columns with a profitable reduced cost. If such columns are found, the LP is re-optimized. Branching occurs when no profitable columns are found, but the LP solution does not satisfy the integrality conditions. Branch and price apply column generation at every node of the branch and bound tree.<br />
<br />
The restricted master primal problem is <br />
<br />
$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j\in \mathcal L} \eta_{ij}\lambda_j+(1+\rho)z_i \ge \rho \ \ \ <br />
\forall i \in [N]\\ <br />
&\sum_{j\in \mathcal L}\lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$<br />
<br />
<br />
Its restricted dual problem is:<br />
<br />
$$ \begin{align*}\max \rho &\sum^{N}_{i=1}w_i + v - \sum^{N}_{i=1}u_i<br />
\\ s.t. &\sum_{i=1}^N \eta_{ij}w_k+ v \le 0 \ \ \ \forall j \in L \\ <br />
&(1+\rho)w_i - u_i \le 1 \ \ \ \forall i \in [N] \\ &w \ge 0, u \ge 0, v\ free\end{align*}$$<br />
<br />
Furthermore, there is a pricing problem used to determine, for every supposed optimal solution of the dual, whether the solution is actually optimal, or whether further constraints need to be added into the primal solution. With this pricing problem, we check whether the solution to the restricted dual is feasible. This pricing problem can be expressed as follows:<br />
<br />
$$ \sum_{i=1}^N \eta_{ij}w_k^* + v^* > 0 $$<br />
<br />
The optimal misclassification values are determined by a branch-and-price process that branches on the variables <math> z_i </math> and solves the intermediate LPs using column generation.<br />
<br />
===Algorithm===<br />
<div style="margin-left: 3em;"><br />
<math> D = \{(x_i, y_i) | i ∈ I\} ⊆ R^d × \{±1\} </math>, class of base learners <math>Ω </math>, margin <math> \rho </math> <br><br />
'''Output:''' Boosted learner <math> \sum_{j∈L^∗}h_jλ_j^* </math> with base learners <math> h_j </math> and weights <math> λ_j^* </math> <br><br />
<br />
<ol><br />
<br />
<li margin-left:30px> <math> T ← \{([0, 1]^N, \emptyset)\} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // set of local bounds and learners for open subproblems </li><br />
<li> <math> U ← \infty, L^∗ ← \emptyset </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Upper bound on optimal objective </li><br />
<li> '''while''' <math>\ T \neq \emptyset </math> '''do''' </li><br />
<li> &emsp; Choose and remove <math>(B,L) </math> from <math>T </math> </li><br />
<li> &emsp; '''repeat''' </li><br />
<li> &emsp; &emsp; Solve the primal IP using the local bounds on <math> z </math> in <math>B</math> with optimal dual solution <math> (w^∗, v^∗, u^∗) </math> </li><br />
<li> &emsp; &emsp; Find learner <math> h_j ∈ Ω </math> satisfying the pricing problem. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Solve pricing problem. </li><br />
<li> &emsp; '''until''' <math> h_j </math> is not found </li> <br />
<li> &emsp; Let <math> (\widetilde{λ} , \widetilde{z}) </math> be the final solution of the primal IP with base learners <math> \widetilde{L} = \{j | \widetilde{λ}_j > 0\} </math> </li><br />
<li> &emsp; '''if''' <math> \widetilde{z} ∈ \mathbb{Z}^N </math> and <math> \sum^{N}_{i=1}\widetilde{z}_i < U </math> '''then''' </li><br />
<li> &emsp; &emsp; <math> U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Update best solution. </li><br />
<li> &emsp; '''else''' </li><br />
<li> &emsp; &emsp; Choose <math> i ∈ [N] </math> with <math> \widetilde{z}_i \notin Z </math> </li><br />
<li> &emsp; &emsp; Set <math> B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} </math> </li><br />
<li> &emsp; &emsp; Add <math> (B_0,\widetilde{L}), (B_1,\widetilde{L}) </math> to <math> T </math>. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Create new branching nodes. </li><br />
<li> &emsp; '''end''' if </li><br />
<li> '''end''' while </li><br />
<li> ''Optionally sparsify final solution <math>L^*</math>'' </li><br />
<br />
</ol><br />
</div><br />
<br />
===Sparsification===<br />
<br />
When building the boosting model, one of the challenges is to balance model accuracy and generalization. A sparse model is generally easier to interpret. In order to control complexity, overfitting, and generalization of the model, typically some sparsity is applied, which is easier to interpret for some applications. There are two techniques commonly used. One is early stopping, another is regularization by adding a complexity term for the learners in the objective function. Previous approaches in the context of LP-based boosting have promoted sparsity by means of cutting planes. Sparsification can be handled in the approach introduced in this paper by solving a delayed integer program using additional cutting planes. Sparsification is done by considering the following optimization problem:<br />
<br />
[[File:ipboostsparsification.png]]<br />
<br />
== Results and Performance ==<br />
<br />
''All tests were run on identical Linux clusters with Intel Xeon Quad Core CPUs, with 3.50GHz, 10 MB cache, and 32 GB of main memory.''<br />
<br />
<br />
The following results reflect IPBoost's performance in hard instances. Note that by hard instances, we mean a binary classification problem with predefined labels. These examples are tailored to using the ±1 classification from learners. On every hard instance sample, IPBoost significantly outperforms both LPBoost and AdaBoost (although implementations depending on the libraries used have often caused results to differ slightly). For the considered instances the best value for the margin ρ was 0.05 for LPBoost and IPBoost; AdaBoost has no margin parameter. The accuracy reported is test accuracy recorded across various different walkthroughs of the algorithm, while <math>L </math> denotes the aggregate number of learners required to find the optimal learner, N is the number of points and <math> \gamma </math> refers to the noise level.<br />
<br />
[[File:ipboostres.png|center]]<br />
<br />
<br />
<br />
For the next table, the classification instances from LIBSVM data sets available at [https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/]. We report accuracies on the test set and train set, respectively. In each case, we report the averages of the accuracies over 10 runs with a different random seed and their standard deviations. We can see IPboost again outperforming LPBoost and AdaBoost significantly. Solving Integer Programming problems is no doubt more computationally expensive than traditional boosting methods like AdaBoost. The average run time of IPBoost (for ρ = 0.05) being 1367.78 seconds, as opposed to LPBoost's 164.35 seconds and AdaBoost's 3.59 seconds reflects exactly that. However, on the flip side, we gain much better stability in our results, as well as higher scores across the board for both training and test sets.<br />
<br />
[[file:svmlibres.png|center]]<br />
<br />
<br />
== Conclusion ==<br />
<br />
IP-boosting avoids the bad performance on well-known hard classes and improves upon LP-boosting and AdaBoost on the LIBSVM instances where even a few percent improvements is valuable. The major drawback is that the running time with the current implementation is much longer. Nevertheless, the algorithm can be improved in the future by solving the intermediate LPs only approximately and deriving tailored heuristics that generate decent primal solutions to save on time.<br />
<br />
The approach is suited very well to an offline setting in which training may take time and where even a small improvement is beneficial or when convex boosters have egregious behaviour. It can also be served as a tool to investigate the general performance of methods like this.<br />
<br />
The IPBoost algorithm added extra complexity into basic boosting models to gain slight accuracy gain while greatly increased the time spent. It is 381 times slower compared to an AdaBoost model on a small dataset which makes the actual usage of this model doubtable. If we are supplied with a larger dataset with millions of records this model would take too long to complete. The base classifier choice was XGBoost which is too complicated for a base classifier, maybe try some weaker learners such as tree stumps to compare the result with other models. In addition, this model might not be accurate compared to the model-ensembling technique where each model utilizes a different algorithm.<br />
<br />
I think IP Boost can be helpful in cases where a small improvement in accuracy is worth the additional computational effort involved. For example, there are applications in Finance where predicting the movement of stock prices slightly more accurately can help traders make millions more. In such cases, investing in expensive computing systems to implement such boosting techniques can perhaps be justified.<br />
<br />
Although the IPBoost can generate better results compared to the base boosting models such as AdaBoost. The time complexity of IPBoost is much higher than other models. Hence, it will be problem to apply IPBoost to the real-world training data since those training data will be large and it's hard to deal with. Thus, we are left with the question that this IPBoost is only a little bit more accurate than the base boosting models but a lot more time complex. So is the model really worth it?<br />
<br />
== Critique ==<br />
<br />
In practise AdaBoost & LP Boost are quite robust to over fitting. In practise if you use simple weak learns stumps (1-level decision trees) then the algorithms are much less prone to over fitting. However the noise level in the data, particularly for AdaBoost is prone to overfitting on noisy datasets. To avoid this use regularised models (AdaBoostReg, LPBoost). Further, when in higher dimensional spaces, AdaBoost can suffer, since its just a linear combination of classifiers. You can use k-fold cross validation to set the stopping parameter to improve this.<br />
<br />
== References ==<br />
<br />
* Pfetsch, M. E., & Pokutta, S. (2020). IPBoost--Non-Convex Boosting via Integer Programming. arXiv preprint arXiv:2002.04679.<br />
<br />
* Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23-37). Springer, Berlin, Heidelberg. pdf</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:ipboostsparsification.png&diff=48407File:ipboostsparsification.png2020-11-30T07:57:07Z<p>A29mukhe: </p>
<hr />
<div></div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Cvmustat&diff=48406User:Cvmustat2020-11-30T07:23:45Z<p>A29mukhe: /* Comments */</p>
<hr />
<div><br />
== Combine Convolution with Recurrent Networks for Text Classification == <br />
'''Team Members''': Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea<br />
<br />
'''Date''': Week of Nov 23 <br />
<br />
== Introduction ==<br />
<br />
<br />
Text classification is the task of assigning a set of predefined categories to natural language texts. It involves learning an embedding layer which allows context-dependent classification. It is a fundamental task in Natural Language Processing (NLP) with various applications such as sentiment analysis, and topic classification. A classic example involving text classification is given a set of News articles, is it possible to classify the genre or subject of each article? Text classification is useful as text data is a rich source of information, but extracting insights from it directly can be difficult and time-consuming as most text data is unstructured.[1] NLP text classification can help automatically structure and analyze text quickly and cost-effectively, allowing for individuals to extract important features from the text easier than before. <br />
<br />
Text classification work mainly focuses on three topics: feature engineering, feature selection, and the use of different types of machine learning algorithms.<br />
:1. Feature engineering, the most widely used feature is the bag of words feature. Some more complex functions are also designed, such as part-of-speech tags, noun phrases, and tree kernels.<br />
:2. Feature selection aims to remove noisy features and improve classification performance. The most common feature selection method is to delete stop words.<br />
:3. Machine learning algorithms usually use classifiers, such as Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machine (SVM).<br />
<br />
In practice, pre-trained word embeddings and deep neural networks are used together for NLP text classification. Word embeddings are used to map the raw text data to an implicit space where the semantic relationships of the words are preserved; words with similar meaning have a similar representation. One can then feed these embeddings into deep neural networks to learn different features of the text. Convolutional neural networks can be used to determine the semantic composition of the text(the meaning), as it treats texts as a 2D matrix by concatenating the embedding of words together. It uses a 1D convolution operator to perform the feature mapping and then conducts a 1D pooling operation over the time domain for obtaining a fixed-length output feature vector, and it can capture both local and position invariant features of the text.[2] Alternatively, Recurrent Neural Networks can be used to determine the contextual meaning of each word in the text (how each word relates to one another) by treating the text as sequential data and then analyzing each word separately. [3] Previous approaches to attempt to combine these two neural networks to incorporate the advantages of both models involve streamlining the two networks, which might decrease their performance. Besides, most methods incorporating a bi-directional Recurrent Neural Network usually concatenate the forward and backward hidden states at each time step, which results in a vector that does not have the interaction information between the forward and backward hidden states.[4] The hidden state in one direction contains only the contextual meaning in that particular direction, however a word's contextual representation, intuitively, is more accurate when collected and viewed from both directions. This paper argues that the failure to observe the meaning of a word in both directions causes the loss of the true meaning of the word, especially for polysemic words (words with more than one meaning) that are context-sensitive.<br />
<br />
== Paper Key Contributions ==<br />
<br />
This paper suggests an enhanced method of text classification by proposing a new way of combining Convolutional and Recurrent Neural Networks (CRNN) involving the addition of a neural tensor layer. The proposed method maintains each network's respective strengths that are normally lost in previous combination methods. The new suggested architecture is called CRNN, which utilizes both a CNN and RNN that run in parallel on the same input sentence. CNN uses weight matrix learning and produces a 2D matrix that shows the importance of each word based on local and position-invariant features. The bidirectional RNN produces a matrix that learns each word's contextual representation; the words' importance about to with concerning the rest of the sentence. A neural tensor layer is introduced on top of the RNN to obtain the fusion of bi-directional contextual information surrounding a particular word. This method combines these two matrix representations and classifies the text, providing the important information of each word for prediction, which helps to explain the results. The model also uses dropout and L2 regularization to prevent overfitting.<br />
<br />
== CRNN Results vs Benchmarks ==<br />
<br />
In order to benchmark the performance of the CRNN model, as well as compare it to other previous efforts, multiple datasets and classification problems were used. All of these datasets are publicly available and can be easily downloaded by any user for testing.<br />
<br />
- '''Movie Reviews:''' a sentiment analysis dataset, with two classes (positive and negative).<br />
<br />
- '''Yelp:''' a sentiment analysis dataset, with five classes. For this test, a subset of 120,000 reviews was randomly chosen from each class for a total of 600,000 reviews.<br />
<br />
- '''AG's News:''' a news categorization dataset, using only the 4 largest classes from the dataset. There are 30000 training samples and 1900 test samples.<br />
<br />
- '''20 Newsgroups:''' a news categorization dataset, again using only 4 large classes (comp, politics, rec, and religion) from the dataset.<br />
<br />
- '''Sogou News:''' a Chinese news categorization dataset, using 10 major classes as a multi-class classification and include 6500 samples randomly from each class.<br />
<br />
- '''Yahoo! Answers:''' a topic classification dataset, with 10 classes and each class contains 140000 training samples and 5000 testing samples.<br />
<br />
For the English language datasets, the initial word representations were created using the publicly available ''word2vec'' [https://code.google.com/p/word2vec/] from Google news. For the Chinese language dataset, ''jieba'' [https://github.com/fxsjy/jieba] was used to segment sentences, and then 50-dimensional word vectors were trained on Chinese ''wikipedia'' using ''word2vec''.<br />
<br />
A number of other models are run against the same data after preprocessing. Some of these models include:<br />
<br />
- '''Self-attentive LSTM:''' an LSTM model with multi-hop attention for sentence embedding.<br />
<br />
- '''RCNN:''' the RCNN's recurrent structure allows for increased depth of capture for contextual information. Less noise is introduced on account of the model's holistic structure (compared to local features).<br />
<br />
The following results are obtained:<br />
<br />
[[File:table of results.png|550px|center]]<br />
<br />
The bold results represent the best performing model for a given dataset. These results show that the CRNN model is the best model for 4 of the 6 datasets, with the Self-attentive LSTM beating the CRNN by 0.03 and 0.12 on the news categorization problems. Considering that the CRNN model has better performance than the Self-attentive LSTM on the other 4 datasets, this suggests that the CRNN model is a better performer overall in the conditions of this benchmark.<br />
<br />
It should be noted that including the neural tensor layer in the CRNN model leads to a significant performance boost compared to the CRNN models without it. The performance boost can be attributed to the fact that the neural tensor layer captures the surrounding contextual information for each word, and brings this information between the forward and backward RNN in a direct method. As seen in the table, this leads to a better classification accuracy across all datasets.<br />
<br />
Another important result was that the CRNN model filter size impacted performance only in the sentiment analysis datasets, as seen in the following table, where the results for the AG's news, Yelp, and Yahoo! Answer datasets are displayed:<br />
<br />
[[File:filter_effects.png|550px|center]]<br />
<br />
== CRNN Model Architecture ==<br />
<br />
The CRNN model is a combination of RNN and CNN. It uses a CNN to compute the importance of each word in the text and utilizes a neural tensor layer to fuse forward and backward hidden states of bi-directional RNN.<br />
<br />
The input of the network is a text, which is a sequence of words. The output of the network is the text representation that is subsequently used as input of a fully-connected layer to obtain the class prediction.<br />
<br />
'''RNN Pipeline:'''<br />
<br />
The goal of the RNN pipeline is to input each word in a text, and retrieve the contextual information surrounding the word and compute the contextual representation of the word itself. This is accomplished by the use of a bi-directional RNN, such that a Neural Tensor Layer (NTL) can combine the results of the RNN to obtain the final output. RNNs are well-suited to NLP tasks because of their ability to sequentially process data such as ordered text.<br />
<br />
A RNN is similar to a feed-forward neural network, but it relies on the use of hidden states. Hidden states are layers in the neural net that produce two outputs: <math> \hat{y}_{t} </math> and <math> h_t </math>. For a time step <math> t </math>, <math> h_t </math> is fed back into the layer to compute <math> \hat{y}_{t+1} </math> and <math> h_{t+1} </math>. <br />
<br />
The pipeline will actually use a variant of RNN called GRU, short for Gated Recurrent Units. This is done to address the vanishing gradient problem which causes the network to struggle to memorize words that came earlier in the sequence. Traditional RNNs are only able to remember the most recent words in a sequence, which may be problematic since words that came at the beginning of the sequence that is important to the classification problem may be forgotten. A GRU attempts to solve this by controlling the flow of information through the network using update and reset gates. <br />
<br />
Let <math>h_{t-1} \in \mathbb{R}^m, x_t \in \mathbb{R}^d </math> be the inputs, and let <math>\mathbf{W}_z, \mathbf{W}_r, \mathbf{W}_h \in \mathbb{R}^{m \times d}, \mathbf{U}_z, \mathbf{U}_r, \mathbf{U}_h \in \mathbb{R}^{m \times m}</math> be trainable weight matrices. Then the following equations describe the update and reset gates:<br />
<br />
<br />
<math><br />
z_t = \sigma(\mathbf{W}_zx_t + \mathbf{U}_zh_{t-1}) \text{update gate} \\<br />
r_t = \sigma(\mathbf{W}_rx_t + \mathbf{U}_rh_{t-1}) \text{reset gate} \\<br />
\tilde{h}_t = \text{tanh}(\mathbf{W}_hx_t + r_t \circ \mathbf{U}_hh_{t-1}) \text{new memory} \\<br />
h_t = (1-z_t)\circ \tilde{h}_t + z_t\circ h_{t-1}<br />
</math><br />
<br />
<br />
Note that <math> \sigma, \text{tanh}, \circ </math> are all element-wise functions. The above equations do the following:<br />
<br />
<ol><br />
<li> <math>h_{t-1}</math> carries information from the previous iteration and <math>x_t</math> is the current input </li><br />
<li> the update gate <math>z_t</math> controls how much past information should be forwarded to the next hidden state </li><br />
<li> the rest gate <math>r_t</math> controls how much past information is forgotten or reset </li><br />
<li> new memory <math>\tilde{h}_t</math> contains the relevant past memory as instructed by <math>r_t</math> and current information from the input <math>x_t</math> </li><br />
<li> then <math>z_t</math> is used to control what is passed on from <math>h_{t-1}</math> and <math>(1-z_t)</math> controls the new memory that is passed on<br />
</ol><br />
<br />
We treat <math>h_0</math> and <math> h_{n+1} </math> as zero vectors in the method. Thus, each <math>h_t</math> can be computed as above to yield results for the bi-directional RNN. The resulting hidden states <math>\overrightarrow{h_t}</math> and <math>\overleftarrow{h_t}</math> contain contextual information around the <math> t</math>-th word in forward and backward directions respectively. Contrary to convention, instead of concatenating these two vectors, it is argued that the word's contextual representation is more precise when the context information from different directions is collected and fused using a neural tensor layer as it permits greater interactions among each element of hidden states. Using these two vectors as input to the neural tensor layer, <math>V^i </math>, we compute a new representation that aggregates meanings from the forward and backward hidden states more accurately as follows:<br />
<br />
<math> <br />
[\hat{h_t}]_i = tanh(\overrightarrow{h_t}V^i\overleftarrow{h_t} + b_i) <br />
</math><br />
<br />
Where <math>V^i \in \mathbb{R}^{m \times m} </math> is the learned tensor layer, and <math> b_i \in \mathbb{R} </math> is the bias.We repeat this <math> m </math> times with different <math>V^i </math> matrices and <math> b_i </math> vectors. Through the neural tensor layer, each element in <math> [\hat{h_t}]_i </math> can be viewed as a different type of intersection between the forward and backward hidden states. In the model, <math> [\hat{h_t}]_i </math> will have the same size as the forward and backward hidden states. We then concatenate the three hidden states vectors to form a new vector that summarizes the context information :<br />
<math><br />
\overleftrightarrow{h_t} = [\overrightarrow{h_t}^T,\overleftarrow{h_t}^T,\hat{h_t}]^T <br />
</math><br />
<br />
We calculate this vector for every word in the text and then stack them all into matrix <math> H </math> with shape <math>n</math>-by-<math>3m</math>.<br />
<br />
<math><br />
H = [\overleftrightarrow{h_1};...\overleftrightarrow{h_n}]<br />
</math><br />
<br />
This <math>H</math> matrix is then forwarded as the results from the Recurrent Neural Network.<br />
<br />
<br />
'''CNN Pipeline:'''<br />
<br />
The goal of the CNN pipeline is to learn the relative importance of words in an input sequence based on different aspects. The process of this CNN pipeline is summarized as the following steps:<br />
<br />
<ol><br />
<li> Given a sequence of words, each word is converted into a word vector using the word2vec algorithm which gives matrix X. <br />
</li><br />
<br />
<li> Word vectors are then convolved through the temporal dimension with filters of various sizes (ie. different K) with learnable weights to capture various numerical K-gram representations. These K-gram representations are stored in matrix C.<br />
</li><br />
<br />
<ul><br />
<li> The convolution makes this process capture local and position-invariant features. Local means the K words are contiguous. Position-invariant means K contiguous words at any position are detected in this case via convolution.<br />
<br />
<li> Temporal dimension example: convolve words from 1 to K, then convolve words 2 to K+1, etc<br />
</li><br />
</ul><br />
<br />
<li> Since not all K-gram representations are equally meaningful, there is a learnable matrix W which takes the linear combination of K-gram representations to more heavily weigh the more important K-gram representations for the classification task.<br />
</li><br />
<br />
<li> Each linear combination of the K-gram representations gives the relative word importance based on the aspect that the linear combination encodes.<br />
</li><br />
<br />
<li> The relative word importance vs aspect gives rise to an interpretable attention matrix A, where each element says the relative importance of a specific word for a specific aspect.<br />
</li><br />
<br />
</ol><br />
<br />
[[File:Group12_Figure1.png |center]]<br />
<br />
<div align="center">Figure 1: The architecture of CRNN.</div><br />
<br />
== Merging RNN & CNN Pipeline Outputs ==<br />
<br />
The results from both the RNN and CNN pipeline can be merged by simply multiplying the output matrices. That is, we compute <math>S=A^TH</math> which has shape <math>z \times 3m</math> and is essentially a linear combination of the hidden states. The concatenated rows of S results in a vector in <math>\mathbb{R}^{3zm}</math> and can be passed to a fully connected Softmax layer to output a vector of probabilities for our classification task. <br />
<br />
To train the model, we make the following decisions:<br />
<ul><br />
<li> Use cross-entropy loss as the loss function (A cross-entropy loss function usually takes in two distributions, a true distribution p and an estimated distribution q, and measures the average number of bits need to identify an event. This calculation is independent of the kind of layers used in the network as well as the kind of activation being implemented.) </li><br />
<li> Perform dropout on random columns in matrix C in the CNN pipeline </li><br />
<li> Perform L2 regularization on all parameters </li><br />
<li> Use stochastic gradient descent with a learning rate of 0.001 </li><br />
</ul><br />
<br />
== Interpreting Learned CRNN Weights ==<br />
<br />
Recall that attention matrix A essentially stores the relative importance of every word in the input sequence for every aspect chosen. Naturally, this means that A is an n-by-z matrix, with n being the number of words in the input sequence and z being the number of aspects considered in the classification task. <br />
<br />
Furthermore, for any specific aspect, words with higher attention values are more important relative to other words in the same input sequence. likewise, for any specific word, aspects with higher attention values prioritize the specific word more than other aspects.<br />
<br />
For example, in this paper, a sentence is sampled from the Movie Reviews dataset, and the transpose of attention matrix A is visualized. Each word represents an element in matrix A, the intensity of red represents the magnitude of an attention value in A, and each sentence is the relative importance of each word for a specific context. In the first row, the words are weighted in terms of a positive aspect, in the last row, the words are weighted in terms of a negative aspect, and in the middle row, the words are weighted in terms of a positive and negative aspect. Notice how the relative importance of words is a function of the aspect.<br />
<br />
[[File:Interpretation example.png|800px|center]]<br />
<br />
From the above sample, it is interesting that the word "but" is viewed as a negative aspect. From a linguistic perspective, the semantic of "but" is incredibly difficult to capture because of the degree of contextual information it needs. In this case, "but" is in the middle of a transition from a negative to a positive so the first row should also have given attention to that word. Also, it seems that the model has learned to give very high attention to the two words directly adjacent to the word of high attention: "is" and "and" beside "powerful", and "an" and "cast" beside "unwieldy".<br />
<br />
The paper also shows that the model can determine important words in the news. The authors take Ag's news dataset and randomly select 10 important words for each class. The table (shown below) contains eye-catching words which fit their classes well.<br />
<br />
[[File:news_words.png|480px|center]]<br />
<br />
== Conclusion & Summary ==<br />
<br />
This paper proposed a new architecture, the Convolutional Recurrent Neural Network, for text classification. The Convolutional Neural Network is used to learn the relative importance of each word from their different aspects and stores this information into a weight matrix. The Recurrent Neural Network learns each word's contextual representation through the combination of the forward and backward context information that is fused using a neural tensor layer and is stored as a matrix. These two matrices are then combined to get the text representation used for classification. Although the specifics of the performed tests are lacking, the experiment's results indicate that the proposed method performed well in comparison to most previous methods. In addition to performing well, the proposed method also provides insight into which words contribute greatly to the classification decision as to the learned matrix from the Convolutional Neural Network stores the relative importance of each word. This information can then be used in other applications or analyses. In the future, one can explore the features extracted from the model and use them to potentially learn new methods such as model space. [5]<br />
<br />
== Critiques ==<br />
<br />
1. In the '''Method''' section of the paper, some explanations used the same notation for multiple different elements of the model. This made the paper harder to follow and understand since they were referring to different elements by identical notation. Additionally, the decision to use sigmoid and hyperbolic tangent functions as nonlinearities for representation learning, is not supported with evidence that these are optimal.<br />
<br />
2. In '''Comparison of Methods''', the authors discuss the range of hyperparameter settings that they search through. While some of the hyperparameters have a large range of search values, three parameters are fixed without much explanation as to why for all experiments, size of the hidden state of GRU, number of layers, and dropout. These parameters have a lot to do with the complexity of the model and this paper could be improved by providing relevant reasoning behind these values, or by providing additional experimental results over different values of these parameters.<br />
<br />
3. In the '''Results''' section of the paper, they tried to show that the classification results from the CRNN model can be better interpreted than other models. In these explanations, the details were lacking and the authors did not adequately demonstrate how their model is better than others.<br />
<br />
4. Finally, in the '''Results''' section again, the paper compares the CRNN model to several models which they did not implement and reproduce results with. This can be seen in the chart of results above, where several models do not have entries in the table for all six datasets. Since the authors used a subset of the datasets, these other models which were not reproduced could have different accuracy scores if they had been tested on the same data as the CRNN model. This difference in training and testing data is not mentioned in the paper, and the conclusion that the CRNN model is better in all cases may not be valid.<br />
<br />
5. Considering the general methodology, the author in the paper chose to fuse CNN with Gated recurrent unit (GRU) with, which is only one version of RNN. However, it has been shown that LSTM generally performs better than GRU, and the author should discuss their choice of using GRU in CRNN instead of LSTM in more detail. Looking at the authors' experimental results, LSTM even outperforms CRNN (with GRU) in some cases, which further motivates the idea of adopting LSTM for RNN to combine with CNN. Whether combining LSTM with CNN will lead to a better performance will of course need further verification, but in principle author should at least address the issue. <br />
<br />
6. This is an interesting method, I would be curious to see if this can be combined or compared with Quasi-Recurrent Neural Networks (https://arxiv.org/abs/1611.01576). In my experience, QRNNs perform similarly to LSTMs while running significantly faster using convolutions with a special temporal pooling. This seems compatible with the neural tensor layer proposed in this paper, which may be combined to yield stronger performance with faster runtimes.<br />
<br />
7. It would be interesting to see how the attention matrix is being constructed and how attention values are being determined in each matrix. For instance, does every different subject have its own attention matrix? If so, how will the situation be handled when the same attention matrix is used in different settings?<br />
<br />
8. The paper shows the CRNN model not performing the best with Ag's news and 20newsgroups. It would be interesting to investigate this in detail and see the difference in the way the data is handled in the model compared to the best performing model(self-attentive LSTM in both datasets).<br />
<br />
9. From the Interpreting Learned CRNN Weights part, the samples are labeled as positive and negative, and their words all have opposite emotional polarities. It can be observed that regardless of whether the polarity of the example is positive or negative, the keyword can be extracted by this method, reflecting that it can capture multiple semantically meaningful components. At the same time it will be very interesting to see if this method is applicable to other specific categories.<br />
<br />
10. The authors of this paper provide 2 examples of what topic classification is, but do not provide any explicit examples of "polysemic words whose meanings are context-sensitive", one of their main critiques of current methods. This is an opportunity to promote the usefulness of their method and engage and inform the reader, simply by listing examples of these words.<br />
<br />
11. In the "Interpreting Learned CRNN Weights" parts, authors gave a figure but did not explain whether this is an example of positive case or negative case. I suggest the author to label the figure. Also, in the paper, we can see there are 2 figures and both of the figures have been labelled as positive or negative but there is only one figure without labelled in the summary.<br />
<br />
12. In the CRNN Results vs Benchmarks, it would be better to provide more details and comparison of other methods other than CRNN. It would be clear and more intuitive for the readers why the author will choose CRNN rather than the others.<br />
<br />
== Comments ==<br />
<br />
- Could this be applied to ancient languages such as hieroglyphs to decipher/better understand them?<br />
<br />
- Another application for CRNN might be classifying spoken language as well as any form of audio.<br />
<br />
-I think it will be better to show more results by using this method. Maybe it will be better to put the result part after the architecture part? Writing a motivation will be better since it will catch readers' "eyes". I think it will be interesting to ask: whether can we apply this to ancient Chinese poetry? Since there are lots of types of ancient Chinese poetry, doing a classification for them will be interesting.<br />
<br />
- In another [https://www.aclweb.org/anthology/W99-0908/ paper] written by Andrew McCallum and Kamal Nigam, they introduce a different method of text classification. Namely, instead of a combination of recurrent and convolutional neural networks, they instead utilized bootstrapping with keywords, Expectation-Maximization algorithm, and shrinkage.<br />
<br />
== References ==<br />
----<br />
<br />
[1] Grimes, Seth. “Unstructured Data and the 80 Percent Rule.” Breakthrough Analysis, 1 Aug. 2008, breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/.<br />
<br />
[2] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modeling sentences,”<br />
arXiv preprint arXiv:1404.2188, 2014.<br />
<br />
[3] K. Cho, B. V. Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning<br />
phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint<br />
arXiv:1406.1078, 2014.<br />
<br />
[4] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in Proceedings<br />
of AAAI, 2015, pp. 2267–2273.<br />
<br />
[5] H. Chen, P. Tio, A. Rodan, and X. Yao, “Learning in the model space for cognitive fault diagnosis,” IEEE<br />
Transactions on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 124–136, 2014.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Streaming_Bayesian_Inference_for_Crowdsourced_Classification&diff=48401Streaming Bayesian Inference for Crowdsourced Classification2020-11-30T07:11:42Z<p>A29mukhe: /* Empirical Analysis */</p>
<hr />
<div>Group 4 Paper Presentation Summary<br />
<br />
By Jonathan Chow, Nyle Dharani, Ildar Nasirov<br />
<br />
== Motivation ==<br />
Crowdsourcing can be a useful tool for data generation in classification projects by collecting annotations of large groups. Typically, this takes the form of online questions which many respondents will manually answer for payment. One example of this is Amazon's Mechanical Turk, a website where businesses (or "requesters") will remotely hire individuals(known as "Turkers" or "crowdsourcers") to perform human intelligence tasks (tasks that computers cannot do). In theory, it is effective in processing high volumes of small tasks that would be expensive to achieve in other methods.<br />
<br />
The primary limitation of this method to acquire data is that respondents can submit incorrect responses so that we couldn't ensure the data quality.<br />
<br />
Therefore, the success of crowd sourcing is limited by how well ground-truth can be determined. The primary method for doing so is probabilistic inference. However, current methods are computationally expensive, lack theoretical guarantees, or are limited to specific settings. In the meantime, there are some approaches to focus on how we sample the data from the crowd, rather than how we aggregate it to improve the accuracy of the system.<br />
<br />
== Dawid-Skene Model for Crowdsourcing ==<br />
The one-coin Dawid-Skene model is popular for contextualizing crowdsourcing problems. For task <math>i</math> in set <math>M</math>, let the ground-truth be the binary <math>y_i = {\pm 1}</math>. We get labels <math>X = {x_{ij}}</math> where <math>j \in N</math> is the index for that worker.<br />
<br />
We assume that we interact with workers in sequential fashion. At each time step <math>t</math>, a worker <math>j = a(t) </math> provides their label for an assigned task <math>i</math> and provides the label <math>x_{ij} = {\pm 1}</math>. We denote responses up to time <math>t</math> via superscript.<br />
<br />
We let <math>x_{ij} = 0</math> if worker <math>j</math> has not completed task <math>i</math>. We assume that <math>P(x_{ij} = y_i) = p_j</math>. This implies that each worker is independent and has equal probability of correctly labelling regardless of task. In crowdsourcing the data, we must determine how workers are assigned to tasks. We introduce two methods.<br />
<br />
Under uniform sampling, workers are allocated to tasks such that each task is completed by the same number of workers, rounded to the nearest integer, and no worker completes a task more than once. This policy is given by <center><math>\pi_{uni}(t) = {\rm argmin}_{i \notin M_{a(t)}^t}\{ | N_i^t | \}.</math></center><br />
<br />
Under uncertainty sampling, we assign more workers to tasks that are less certain. Assuming, we are able to estimate the posterior probability of ground-truth, we can allocate workers to the task with the lowest probability of falling into the predicted class. This policy is given by <center><math>\pi_{us}(t) = {\rm argmin}_{i \notin M_{a(t)}^t}\{ (max_{k \in \{\pm 1\}} ( P(y_i = k | X^t) ) \}.</math></center><br />
<br />
We then need to aggregate the data. The simple method of majority voting makes predictions for a given task based on the class the most workers have assigned it, <math>\hat{y}_i = \text{sign}\{\sum_{j \in N_i} x_{ij}\}</math>.<br />
<br />
== Streaming Bayesian Inference for Crowdsourced Classification (SBIC) ==<br />
The aim of the SBIC algorithm is to estimate the posterior probability, <math>P(y, p | X^t, \theta)</math> where <math>X^t</math> are the observed responses at time <math>t</math> and <math>\theta</math> is our prior. We can then generate predictions <math>\hat{y}^t</math> as the marginal probability over each <math>y_i</math> given <math>X^t</math>, and <math>\theta</math>.<br />
<br />
We factor <math>P(y, p | X^t, \theta) \approx \prod_{i \in M} \mu_i^t (y_i) \prod_{j \in N} \nu_j^t (p_j) </math> where <math>\mu_i^t</math> corresponds to each task and <math>\nu_j^t</math> to each worker.<br />
<br />
We then sequentially optimize the factors <math>\mu^t</math> and <math>\nu^t</math>. We begin by assuming that the worker accuracy follows a beta distribution with parameters <math>\alpha</math> and <math>\beta</math>. Initialize the task factors <math>\mu_i^0(+1) = q</math> and <math>\mu_i^0(-1) = 1 – q</math> for all <math>i</math>.<br />
<br />
When a new label is observed at time <math>t</math>, we update the <math>\nu_j^t</math> of worker <math>j</math>. We then update <math>\mu_i</math>. These updates are given by<br />
<br />
<center><math>\nu_j^t(p_j) \sim \text{Beta}(\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_{ij}) + \alpha, \sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(-x_{ij}) + \beta) </math></center><br />
<br />
<center><math>\mu_i^t(y_i) \propto \begin{cases} \mu_i^{t - 1}(y_i)\overline{p}_j^t & x_{ij} = y_i \\ \mu_i^{t - 1}(y_i)(1 - \overline{p}_j^t) & x_{ij} \ne y_i \end{cases}</math></center><br />
where <math>\hat{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_{ij}) + \alpha}{|M_j^{t - 1}| + \alpha + \beta }</math>.<br />
<br />
We choose our predictions to be the maximum <math>\mu_i^t(k) </math> for <math>k= \{-1,1\}</math>.<br />
<br />
Depending on our ordering of labels <math>X</math>, we can select for different applications.<br />
<br />
== Fast SBIC ==<br />
The pseudocode for Fast SBIC is shown below.<br />
<br />
<center>[[Image:FastSBIC.png|800px|]]</center><br />
<br />
As the name implies, the goal of this algorithm is speed. To facilitate this, we leave the order of <math>X</math> unchanged.<br />
<br />
We express <math>\mu_i^t</math> in terms of its log-odds<br />
<center><math>z_i^t = \log(\frac{\mu_i^t(+1)}{ \mu_i^t(-1)}) = z_i^{t - 1} + x_{ij} \log(\frac{\overline{p}_j^t}{1 - \overline{p}_j^t })</math></center><br />
where <math>z_i^0 = \log(\frac{q}{1 - q})</math>.<br />
<br />
The product chain then becomes a summation and removes the need to normalize each <math>\mu_i^t</math>. We use these log-odds to compute worker accuracy,<br />
<br />
<center><math>\overline{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} \sigma(x_{ij} z_i^{t-1}) + \alpha}{|M_j^{t - 1}| + \alpha + \beta}</math></center><br />
where <math>\sigma(z_i^{t-1}) := \frac{1}{1 + exp(-z_i^{t - 1})} = \mu_i^{t - 1}(+1) </math><br />
<br />
The final predictions are made by choosing class <math>\hat{y}_i^T = \text{sign}(z_i^T) </math>. We see later that Fast SBIC has similar computational speed to majority voting.<br />
<br />
== Sorted SBIC ==<br />
To increase the accuracy of the SBIC algorithm in exchange for computational efficiency, we run the algorithm in parallel giving labels in different orders. The pseudocode for this algorithm is given below.<br />
<br />
<center>[[Image:SortedSBIC.png|800px|]]</center><br />
<br />
From the general discussion of SBIC, we know that predictions on task <math>i</math> are more accurate toward the end of the collection process. This is a result of observing more data points and having run more updates on <math>\mu_i^t</math> and <math>\nu_j^t</math> to move them further from their prior. This means that task <math>i</math> is predicted more accurately when its corresponding labels are seen closer to the end of the process.<br />
<br />
We take advantage of this property by maintaining a distinct “view” of the log-odds for each task. When a label is observed, we update views for all tasks except the one for which the label was observed. At the end of the collection process, we process skipped labels. When run online, this process must be repeated at every timestep.<br />
<br />
We see that sorted SBIC is slower than Fast SBIC by a factor of M, the number of tasks. However, we can reduce the complexity by viewing <math>s^k</math> across different tasks in an offline setting when the whole dataset is known in advance.<br />
<br />
== Theoretical Analysis ==<br />
The authors prove an exponential relationship between the error probability and the number of labels per task. This allows for an upper bound to be placed on the error, in the asymptotic regime, where it can be assumed the SBIC predictions are close to the true values. The two theorems, for the different sampling regimes, are presented below.<br />
<br />
<center>[[Image:Theorem1.png|800px|]]</center><br />
<br />
<center>[[Image:Theorem2.png|800px|]]</center><br />
<br />
== Empirical Analysis ==<br />
The purpose of the empirical analysis is to compare SBIC to the existing state of the art algorithms. The SBIC algorithm is run on five real-world binary classification datasets. The results can be found in the table below. Other algorithms in the comparison are, from left to right, majority voting, expectation-maximization, mean-field, belief-propagation, Monte-Carlo sampling, and triangular estimation. <br />
<br />
First of all, the algorithms are run on a synthetic data that meets the assumptions of an underlying one-coin Dawid-Skene model, which allows the authors to compare SBIC's performance empirically with the theoretical results previously shown. <br />
<br />
Second, the algorithms are run on real-world data. The performance of each algorithm is shown in the table below.<br />
<br />
<center>[[Image:RealWorldResults.png|800px|]]</center><br />
<br />
In bold are the best performing algorithms for each dataset. We see that both versions of the SBIC algorithm are competitive, having similar prediction errors to EM, AMF, and MC. All are considered state-of-the-art Bayesian algorithms.<br />
<br />
The figure below shows the average time required to simulate predictions on synthetic data under an uncertainty sampling policy. We see that Fast SBIC is comparable to majority voting and significantly faster than the other algorithms. This speed improvement, coupled with comparable accuracy, makes the Fast SBIC algorithm powerful.<br />
<br />
<center>[[Image:TimeRequirement.png|800px|]]</center><br />
<br />
== Conclusion and Future Research ==<br />
In conclusion, we have seen that SBIC is computationally efficient, accurate in practice, and has theoretical guarantees. The authors intend to extend the algorithm to the multi-class case in the future.<br />
<br />
== Critique ==<br />
In crowdsourcing data, the cost associated with collecting additional labels is not usually prohibitively expensive. As a result, if there is concern over ground-truth, paying for additional data to ensure X is sufficiently dense may be the desired response as opposed to sacrificing ground-truth accuracy. This may result in the SBIC algorithm being less practically useful than intended. Perhaps, a study can be done to compare the cost of acquiring additional data and checking how much it improves the accuracy of ground-truth. This can help us decide the usefulness of this algorithm. Second, SBIC should be used in multiple types of data like audio, text, and images to make sure that it delivers consistent results.<br />
<br />
The paper is tackling the classic problem of aggregating labels in a crowdsourced application, with a focus on speed. The algorithms proposed are fast and simple to implement and come with theoretical guarantees on the bounds for error rates. However, the paper starts with an objective of designing fast label aggregation algorithms for a streaming setting yet doesn’t spend any time motivating the applications in which such algorithms are needed. All the datasets used in the empirical analysis are static datasets therefore for the paper to be useful, the problem considered should be well motivated. It also appears that the output from the algorithm depends on the order in which the data is processed, which may need to be clarified. Finally, the theoretical results are presented under the assumption that the predictions of the FBI converge to the ground truth, however, the reasoning behind this assumption is not explained.<br />
<br />
The paper assumes that crowdsourcing from human beings is systematic: that is, respondents to the problems would act in similar ways that can be classified into some categories. There are lots of other factors that need to be considered for a human respondent, such as fatigue effects and conflicts of interest. Those factors would seriously jeopardize the validity of the results and the model if they were not carefully designed and taken care of. For example, one formally accurate subject reacts badly to the subject one day generating lots of faulty data, and it would take lots of correct votes to even out the effects. Even in lots of medical experiment that involves human subjects, with rigorous standards and procedures, the results could still be invalid. The trade-off for speed while sacrificing the validity of results is not wise.<br />
<br />
When introducing the Streaming Bayesian Inference for Crowdsourced Classification method, it was explicitly mentioned that one can select for different applications based on the ordering of labels X. However, "applications" mentioned here were not further explained or explored to support the effectiveness and meaning of developing such an algorithm. Thus, it would be sufficient for the author to build a connection between the proposed algorithm and its real-world application to make this summary more purposeful and engaging.<br />
<br />
In the Dawid-Skene Model for Crowdsourcing part, the summary just indicated the simplest part of the data aggregation which is unusual in the real world. Instead, we can use Bayesian methods which infer the value of the latent variables y and p by estimating their posterior probability P(y,p|X,θ) given the observed data X and prior θ.<br />
<br />
== References ==<br />
[1] Manino, Tran-Thanh, and Jennings. Streaming Bayesian Inference for Crowdsourced Classification. 33rd Conference on Neural Information Processing Systems, 2019</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Influenza_Forecasting_Framework_based_on_Gaussian_Processes&diff=48380Influenza Forecasting Framework based on Gaussian Processes2020-11-30T06:44:44Z<p>A29mukhe: /* References */</p>
<hr />
<div><br />
== Abstract ==<br />
<br />
This paper presents a novel framework for seasonal epidemic forecasting using Gaussian process regression. Compared with other state-of-the-art models, the new novel framework introduced in this paper shows accurate retrospective forecasts through training a subset of the CDC influenza-like-illness (ILI) dataset using the official CDC scoring rule (log-score).<br />
<br />
== Background ==<br />
<br />
Each year, the seasonal influenza epidemic affects public health systems across the globe on a massive scale. In 2019-2020 alone, the United States reported 38 million cases, 400 000 hospitalizations, and 22 000 deaths [1]. A direct result of this is the need for reliable forecasts of future influenza development because they allow for improved public health policies and informed resource development and allocation. Many statistical methods have been developed to use data from the CDC and other real-time data sources, such as Google Trends to forecast influenza activities.<br />
<br />
Given the process of data collection and surveillance lag, accurate statistics for influenza warning systems are often delayed by some margin of time, making early prediction imperative. However, there are challenges in long-term epidemic forecasting. First, the temporal dependency is hard to capture with short-term input data. Without manually added seasonal trends, most statistical models fail to provide high accuracy. Second, the influence from other locations has not been exhaustively explored with limited data input. Spatio-temporal effects would therefore require adequate data sources to achieve good performance.<br />
<br />
== Related Work ==<br />
<br />
Given the value of epidemic forecasts, the CDC regularly publishes ILI data and has funded a seasonal ILI forecasting challenge. This challenge has led to four state-of-the-art models in the field: Historical averages, a method using the counts of past seasons to build a kernel density estimate; MSS, a physical susceptible-infected-recovered model with assumed linear noise [5]; SARIMA, a framework based on seasonal auto-regressive moving average models [2]; and LinEns, an ensemble of three linear regression models.<br />
<br />
One disadvantage of forecasting is that the publication of CDC official ILI data is usually delayed by 1-3 weeks. Therefore, several attempts to use indirect web-based data to gain more timely estimates of CDC’s ILI, which is also known as nowcasting. The forecasting framework introduced in this paper can use those nowcasting approaches to receive a more timely data stream.<br />
<br />
== Motivation ==<br />
<br />
It has been shown that LinEns forecasts outperform the other frameworks on the ILI data-set. However, this framework assumes a deterministic relationship between the epidemic week and its case count, which does not reflect the stochastic nature of the trend. Therefore, it is natural to ask whether a similar framework that assumes a stochastic relationship between these variables would provide better performance. This motivated the development of the proposed Gaussian process regression framework and the subsequent performance comparison to the benchmark models.<br />
<br />
== Gaussian Process Regression ==<br />
<br />
A Gaussian process is specified with a mean function and a kernel function. Besides, the mean and the variance function of the Gaussian process is defined by:<br />
\[\mu(x) = k(x, X)^T(K_n+\sigma^2I)^{-1}Y\]<br />
\[\sigma(x) = k^{**}(x,x)-k(x,X)^T(K_n+\sigma^2I)^{-1}k(x,X)\]<br />
where <math>K_n</math> is the covariance matrix and for the kernel, it is being specified that it will use the Gaussian kernel.<br />
<br />
Consider the following set up: let <math>X = [\mathbf{x}_1,\ldots,\mathbf{x}_n]</math> <math>(d\times n)</math> be your training data, <math>\mathbf{y} = [y_1,y_2,\ldots,y_n]^T</math> be your noisy observations where <math>y_i = f(\mathbf{x}_i) + \epsilon_i</math>, <math>(\epsilon_i:i = 1,\ldots,n)</math> i.i.d. <math>\sim \mathcal{N}(0,{\sigma}^2)</math>, and <math>f</math> is the trend we are trying to model (by <math>\hat{f}</math>). Let <math>\mathbf{x}^*</math> <math>(d\times 1)</math> be your test data point, and <math>\hat{y} = \hat{f}(\mathbf{x}^*)</math> be your predicted outcome.<br />
<br />
<br />
Instead of assuming a deterministic form of <math>f</math>, and thus of <math>\mathbf{y}</math> and <math>\hat{y}</math> (as classical linear regression would, for example), Gaussian process regression assumes <math>f</math> is stochastic. More precisely, <math>\mathbf{y}</math> and <math>\hat{y}</math> are assumed to have a joint prior distribution. Indeed, we have <br />
<br />
$$<br />
(\mathbf{y},\hat{y}) \sim \mathcal{N}(0,\Sigma(X,\mathbf{x}^*))<br />
$$<br />
<br />
where <math>\Sigma(X,\mathbf{x}^*)</math> is a matrix of covariances dependent on some kernel function <math>k</math>. In this paper, the kernel function is assumed to be Gaussian and takes the form <br />
<br />
$$<br />
k(\mathbf{x}_i,\mathbf{x}_j) = \sigma^2\exp(-\frac{1}{2}(\mathbf{x}_i-\mathbf{x}_j)^T\Sigma(\mathbf{x}_i-\mathbf{x}_j)).<br />
$$<br />
<br />
It is important to note that this gives a joint prior distribution of '''functions''' ('''Fig. 1''' left, grey curves). <br />
<br />
By restricting this distribution to contain only those functions ('''Fig. 1''' right, grey curves) that agree with the observed data points <math>\mathbf{x}</math> ('''Fig. 1''' right, solid black) we obtain the posterior distribution for <math>\hat{y}</math> which has the form<br />
<br />
$$<br />
p(\hat{y} | \mathbf{x}^*, X, \mathbf{y}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}),\sigma(\mathbf{x}^*,X))<br />
$$<br />
<br />
<br />
<div style="text-align:center;"> [[File:GPRegression.png|500px]] </div><br />
<br />
<div align="center">'''Figure 1. Gaussian process regression''': <br />
<div align="center"><br />
Select the functions from your joint prior distribution (left, grey curves) with mean <math>0</math> (left, bold line) that agree with the observed data points (right, black bullets). </div><br />
<div align="center">These form your posterior distribution (right, grey curves) with mean <math>\mu(\mathbf{x})</math> (right, bold line). Red triangle helps compare the two images (location marker) [4]. </div><br />
</div><br />
<br />
== Data-set ==<br />
<br />
Let <math>d_j^i</math> denote the number of epidemic cases recorded in week <math>j</math> of season <math>i</math>, and let <math>j^*</math> and <math>i^*</math> denote the current week and season, respectively. The ILI data-set contains <math>d_j^i</math> for all previous weeks and seasons, up to the current season with a 1-3 week publishing delay. Note that a season refers to the time of year when the epidemic is prevalent (e.g. an influenza season lasts 30 weeks and contains the last 10 weeks of year k, and the first 20 weeks of year k+1). The goal is to predict <math>\hat{y}_T = \hat{f}_T(x^*) = d^{i^*}_{j* + T}</math> where <math>T, \;(T = 1,\ldots,K)</math> is the target week (how many weeks into the future that you want to predict).<br />
<br />
To do this, a design matrix <math>X</math> is constructed where each element <math>X_{ji} = d_j^i</math> corresponds to the number of cases in week (row) j of season (column) i. The training outcomes <math>y_{i,T}, i = 1,\ldots,n</math> correspond to the number of cases that were observed in target week <math>T,\; (T = 1,\ldots,K)</math> of season <math>i, (i = 1,\ldots,n)</math>.<br />
<br />
== Proposed Framework ==<br />
<br />
To compute <math>\hat{y}</math>, the following algorithm is executed: <br />
<br />
<ol><br />
<br />
<li> Let <math>J \subseteq \{j^*-4 \leq j \leq j^*\}</math> (subset of possible weeks).<br />
<br />
<li> Assemble the Training Set <math>\{X_J, \mathbf{y}_{T,J}\}</math> .<br />
<br />
<li> Train the Gaussian process.<br />
<br />
<li> Calculate the '''distribution''' of <math>\hat{y}_{T,J}</math> using <math>p(\hat{y}_{T,J} | \mathbf{x}^*, X_J, \mathbf{y}_{T,J}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}_{T,J}),\sigma(\mathbf{x}^*,X_J))</math>.<br />
<br />
<li> Set <math>\hat{y}_{T,J} =\mu(x^*,X_J,\mathbf{y}_{T,J})</math>.<br />
<br />
<li> Repeat steps 2-5 for all sets of weeks <math>J</math>.<br />
<br />
<li> Determine the best 3 performing sets J (on the 2010/11 and 2011/12 validation sets).<br />
<br />
<li> Calculate the ensemble forecast by averaging the 3 best performing predictive distribution densities i.e. <math>\hat{y}_T = \frac{1}{3}\sum_{k=1}^3 \hat{y}_{T,J_{best}}</math>.<br />
<br />
</ol><br />
<br />
== Results ==<br />
<br />
To demonstrate the accuracy of their results, retrospective forecasting was done on the ILI data-set. Retrospective forecasting involves using information from a specific time period in the past for example a certain week of the previous year and tries to forecast future values, for example, the next week's cases. Then, the forecasting accuracy can be evaluated on the actual observed value. In other words, the Gaussian process model was trained to assume a previous season (2012/13) was the current season. In this fashion, the forecast could be compared to the already observed true outcome. <br />
<br />
To produce a forecast for the entire 2012/13 season, 30 Gaussian processes were trained (each influenza season has 30 test points <math>\mathbf{x^*}</math>) and a curve connecting the predicted outputs <math>y_T = \hat{f}(\mathbf{x^*)}</math> was plotted ('''Fig.2''', blue line). As shown in '''Fig.2''', this forecast (blue line) was reliable for both 1 (left) and 3 (right) week targets, given that the 95% prediction interval ('''Fig.2''', purple shaded) contained the true values ('''Fig.2''', red x's) 95% of the time.<br />
<br />
<div style="text-align:center;"> [[File:ResultsOne.png|600px]] </div><br />
<br />
<div align="center">'''Figure 2. Retrospective forecasts and their uncertainty''': <br />
<div align="center">One week retrospective influenza forecasting for two targets (T = 1, 3). Red x’s are the true observed values, and blue lines and purple shaded areas represent point forecasts and 95% prediction intervals, respectively. </div><br />
</div><br />
<br />
Moreover, as shown in '''Fig.3''', the novel Gaussian process regression framework outperformed all state-of-the-art models, included LinEns, for four different targets <math>(T = 1,\ldots, 4)</math>, when compared using the official CDC scoring criterion ''log-score''. Log-score describes the logarithmic probability of the forecast being within an interval around the true value. <br />
<br />
<div style="text-align:center;"> [[File:ComparisonNew.png|600px]] </div><br />
<br />
<div align="center">'''Figure 3. Average log-score gain of proposed framework''': <br />
<div align="center"> Each bar shows the mean seasonal log-score gain of the proposed framework vs. the given state-of-the-art model, and each panel corresponds to a different target week <math> T = 1,...4 </math>. </div><br />
</div><br />
<br />
== Conclusion ==<br />
<br />
This paper presented a novel framework for forecasting seasonal epidemics using Gaussian process regression that outperformed multiple state-of-the-art forecasting methods on the CDC's ILI data-set. Since influenza data has only been collected from 2003, therefore as more data is obtained: the training models can be further improved upon with more robust model optimizations. Hence, this work may play a key role in future influenza forecasting and as result, the improvement of public health policies and resource allocation.<br />
<br />
== Critique ==<br />
<br />
The proposed framework provides a computationally efficient method to forecast any seasonal epidemic count data that is easily extendable to multiple target types. In particular, one can compute key parameters such as the peak infection incidence <math>\left(\hat{y} = \text{max}_{0 \leq j \leq 52} d^i_j \right)</math>, the timing of the peak infection incidence <math>\left(\hat{y} = \text{argmax}_{0 \leq j \leq 52} d^i_j\right)</math> and the final epidemic size of a season <math>\left(\hat{y} = \sum_{j=1}^{52} d^i_j\right)</math>. However, given it is not a physical model, it cannot provide insights on parameters describing the disease spread. Moreover, the framework requires training data and hence, is not applicable for non-seasonal epidemics.<br />
<br />
Consider the Gaussian Process Regression, we can apply it on the COVID-19. Hence, we can add the data obtained from COVID-19 to the data-set part. Then apply the algorithm that was outlined from the project. We can predict the high wave period to better protect ourselves. But the thing is how can we obtain the training data from COVID-19 since there will be many missing values since many people may not know whether they have been affected or not. Thus, it will be a big problem to obtain the best training data.<br />
<br />
There is one Gaussian property which makes most of its models extremely tractable. A linear combination of two or more Gaussian distributions is still a Gaussian distribution. It's very easy to determine the mean and variance of the newly constructed Gaussian distribution. But there's a constraint of applying this property to several Gaussian distributions. All Gaussian distributions should be linearly independent of others, otherwise, the linear combination distribution might not be Gaussian anymore.<br />
<br />
This paper provides a state of the art approach for forecasting epidemics. It would have been interesting to see other types of kernels being used, such as a periodic kernel <math>k(x, x') = \sigma^2 \exp\left({-\frac{2 \sin^2 (\pi|x-x'|)/p}{l^2}}\right) </math>, as intuitively epidemics are known to have waves within seasons. This may have resulted in better-calibrated uncertainty estimates as well.<br />
<br />
It is mentioned that the the framework might not be good for non-seasonal epidemics because it requires training data, given that the COVID-19 pandemic comes in multiple waves and we have enough data from the first wave and second wave, we might be able to use this framework to predict the third wave and possibly the fourth one as well. It'd be interesting to see this forecasting framework being trained using the data from the first and second wave of COVID-19.<br />
<br />
Gaussian Process Regression (GPR) can be extended to COVID-19 outbreak forecast [3]. A Multi-Task Gaussian Process Regression (MTGP) model is a special case of GPR that yields multiple outputs. It was first proposed back in 2008, and its recent application was in the battery capacity prediction by Richardson et. al. The proposed MTGP model was applied on Worldwide, China, India, Italy, and USA data that covers COVID-19 statistics from Dec 31 2019 to June 25 2020. Results show that MTGP outperforms linear regression, support vector regression, random forest regression, and long short-term memory model when it comes to forecasting accuracy.<br />
<br />
A critique for GPR is that it does not take into account the physicality of the epidemic process. So it would be interesting to investigate if we can incorporate elements of physical processes in this such as using Ordinary Differential Equations (ODE) based modeling and seeing if that improves prediction accuracy or helps us explain phenomena.<br />
<br />
It will be interesting to provide a comparison of methods described in this paper with optimal control theory for modelling infectious disease.<br />
<br />
This is a very nice summary and everything was discussed clearly. I think it will be interesting that apply GPR by using the data in COVID-19, maybe this will be helpful to give a sense that how likely that 3rd wave of COVID-19 will happen. Maybe it will be better to discuss more that a training data set from a location will not train a good model for other countries. (p.s. I think <math> \sigma </math> is the standard deviation, sometimes saying variance function with the notation <math> \sigma </math> confused me.).<br />
<br />
== References ==<br />
<br />
[1] Estimated Influenza Illnesses, Medical visits, Hospitalizations, and Deaths in the United States - 2019–2020 Influenza Season. (2020). Retrieved November 16, 2020, from https://www.cdc.gov/flu/about/burden/2019-2020.html<br />
<br />
[2] Ray, E. L., Sakrejda, K., Lauer, S. A., Johansson, M. A.,and Reich, N. G. (2017).Infectious disease prediction with kernel conditional densityestimation.Statistics in Medicine, 36(30):4908–4929.<br />
<br />
[3] Ketu, S., Mishra, P.K. Enhanced Gaussian process regression-based forecasting model for COVID-19 outbreak and significance of IoT for its detection. Appl Intell (2020). https://doi.org/10.1007/s10489-020-01889-9<br />
<br />
[4] Schulz, E., Speekenbrink, M., and Krause, A. (2017).A tutorial on gaussian process regression with a focus onexploration-exploitation scenarios.bioRxiv.<br />
<br />
[5] Zimmer, C., Leuba, S. I., Cohen, T., and Yaesoubi, R.(2019).Accurate quantification of uncertainty in epidemicparameter estimates and predictions using stochasticcompartmental models.Statistical Methods in Medical Research,28(12):3591–3608.PMID: 30428780.<br />
<br />
[6] Bussell E. H., Dangerfield C. E., Gilligan C. A. and Cunniffe N. J. 2019Applying optimal control theory to complex epidemiological models to inform real-world disease managementPhil. Trans. R. Soc. B37420180284 http://doi.org/10.1098/rstb.2018.0284</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Influenza_Forecasting_Framework_based_on_Gaussian_Processes&diff=48379Influenza Forecasting Framework based on Gaussian Processes2020-11-30T06:43:38Z<p>A29mukhe: /* Critique */</p>
<hr />
<div><br />
== Abstract ==<br />
<br />
This paper presents a novel framework for seasonal epidemic forecasting using Gaussian process regression. Compared with other state-of-the-art models, the new novel framework introduced in this paper shows accurate retrospective forecasts through training a subset of the CDC influenza-like-illness (ILI) dataset using the official CDC scoring rule (log-score).<br />
<br />
== Background ==<br />
<br />
Each year, the seasonal influenza epidemic affects public health systems across the globe on a massive scale. In 2019-2020 alone, the United States reported 38 million cases, 400 000 hospitalizations, and 22 000 deaths [1]. A direct result of this is the need for reliable forecasts of future influenza development because they allow for improved public health policies and informed resource development and allocation. Many statistical methods have been developed to use data from the CDC and other real-time data sources, such as Google Trends to forecast influenza activities.<br />
<br />
Given the process of data collection and surveillance lag, accurate statistics for influenza warning systems are often delayed by some margin of time, making early prediction imperative. However, there are challenges in long-term epidemic forecasting. First, the temporal dependency is hard to capture with short-term input data. Without manually added seasonal trends, most statistical models fail to provide high accuracy. Second, the influence from other locations has not been exhaustively explored with limited data input. Spatio-temporal effects would therefore require adequate data sources to achieve good performance.<br />
<br />
== Related Work ==<br />
<br />
Given the value of epidemic forecasts, the CDC regularly publishes ILI data and has funded a seasonal ILI forecasting challenge. This challenge has led to four state-of-the-art models in the field: Historical averages, a method using the counts of past seasons to build a kernel density estimate; MSS, a physical susceptible-infected-recovered model with assumed linear noise [5]; SARIMA, a framework based on seasonal auto-regressive moving average models [2]; and LinEns, an ensemble of three linear regression models.<br />
<br />
One disadvantage of forecasting is that the publication of CDC official ILI data is usually delayed by 1-3 weeks. Therefore, several attempts to use indirect web-based data to gain more timely estimates of CDC’s ILI, which is also known as nowcasting. The forecasting framework introduced in this paper can use those nowcasting approaches to receive a more timely data stream.<br />
<br />
== Motivation ==<br />
<br />
It has been shown that LinEns forecasts outperform the other frameworks on the ILI data-set. However, this framework assumes a deterministic relationship between the epidemic week and its case count, which does not reflect the stochastic nature of the trend. Therefore, it is natural to ask whether a similar framework that assumes a stochastic relationship between these variables would provide better performance. This motivated the development of the proposed Gaussian process regression framework and the subsequent performance comparison to the benchmark models.<br />
<br />
== Gaussian Process Regression ==<br />
<br />
A Gaussian process is specified with a mean function and a kernel function. Besides, the mean and the variance function of the Gaussian process is defined by:<br />
\[\mu(x) = k(x, X)^T(K_n+\sigma^2I)^{-1}Y\]<br />
\[\sigma(x) = k^{**}(x,x)-k(x,X)^T(K_n+\sigma^2I)^{-1}k(x,X)\]<br />
where <math>K_n</math> is the covariance matrix and for the kernel, it is being specified that it will use the Gaussian kernel.<br />
<br />
Consider the following set up: let <math>X = [\mathbf{x}_1,\ldots,\mathbf{x}_n]</math> <math>(d\times n)</math> be your training data, <math>\mathbf{y} = [y_1,y_2,\ldots,y_n]^T</math> be your noisy observations where <math>y_i = f(\mathbf{x}_i) + \epsilon_i</math>, <math>(\epsilon_i:i = 1,\ldots,n)</math> i.i.d. <math>\sim \mathcal{N}(0,{\sigma}^2)</math>, and <math>f</math> is the trend we are trying to model (by <math>\hat{f}</math>). Let <math>\mathbf{x}^*</math> <math>(d\times 1)</math> be your test data point, and <math>\hat{y} = \hat{f}(\mathbf{x}^*)</math> be your predicted outcome.<br />
<br />
<br />
Instead of assuming a deterministic form of <math>f</math>, and thus of <math>\mathbf{y}</math> and <math>\hat{y}</math> (as classical linear regression would, for example), Gaussian process regression assumes <math>f</math> is stochastic. More precisely, <math>\mathbf{y}</math> and <math>\hat{y}</math> are assumed to have a joint prior distribution. Indeed, we have <br />
<br />
$$<br />
(\mathbf{y},\hat{y}) \sim \mathcal{N}(0,\Sigma(X,\mathbf{x}^*))<br />
$$<br />
<br />
where <math>\Sigma(X,\mathbf{x}^*)</math> is a matrix of covariances dependent on some kernel function <math>k</math>. In this paper, the kernel function is assumed to be Gaussian and takes the form <br />
<br />
$$<br />
k(\mathbf{x}_i,\mathbf{x}_j) = \sigma^2\exp(-\frac{1}{2}(\mathbf{x}_i-\mathbf{x}_j)^T\Sigma(\mathbf{x}_i-\mathbf{x}_j)).<br />
$$<br />
<br />
It is important to note that this gives a joint prior distribution of '''functions''' ('''Fig. 1''' left, grey curves). <br />
<br />
By restricting this distribution to contain only those functions ('''Fig. 1''' right, grey curves) that agree with the observed data points <math>\mathbf{x}</math> ('''Fig. 1''' right, solid black) we obtain the posterior distribution for <math>\hat{y}</math> which has the form<br />
<br />
$$<br />
p(\hat{y} | \mathbf{x}^*, X, \mathbf{y}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}),\sigma(\mathbf{x}^*,X))<br />
$$<br />
<br />
<br />
<div style="text-align:center;"> [[File:GPRegression.png|500px]] </div><br />
<br />
<div align="center">'''Figure 1. Gaussian process regression''': <br />
<div align="center"><br />
Select the functions from your joint prior distribution (left, grey curves) with mean <math>0</math> (left, bold line) that agree with the observed data points (right, black bullets). </div><br />
<div align="center">These form your posterior distribution (right, grey curves) with mean <math>\mu(\mathbf{x})</math> (right, bold line). Red triangle helps compare the two images (location marker) [4]. </div><br />
</div><br />
<br />
== Data-set ==<br />
<br />
Let <math>d_j^i</math> denote the number of epidemic cases recorded in week <math>j</math> of season <math>i</math>, and let <math>j^*</math> and <math>i^*</math> denote the current week and season, respectively. The ILI data-set contains <math>d_j^i</math> for all previous weeks and seasons, up to the current season with a 1-3 week publishing delay. Note that a season refers to the time of year when the epidemic is prevalent (e.g. an influenza season lasts 30 weeks and contains the last 10 weeks of year k, and the first 20 weeks of year k+1). The goal is to predict <math>\hat{y}_T = \hat{f}_T(x^*) = d^{i^*}_{j* + T}</math> where <math>T, \;(T = 1,\ldots,K)</math> is the target week (how many weeks into the future that you want to predict).<br />
<br />
To do this, a design matrix <math>X</math> is constructed where each element <math>X_{ji} = d_j^i</math> corresponds to the number of cases in week (row) j of season (column) i. The training outcomes <math>y_{i,T}, i = 1,\ldots,n</math> correspond to the number of cases that were observed in target week <math>T,\; (T = 1,\ldots,K)</math> of season <math>i, (i = 1,\ldots,n)</math>.<br />
<br />
== Proposed Framework ==<br />
<br />
To compute <math>\hat{y}</math>, the following algorithm is executed: <br />
<br />
<ol><br />
<br />
<li> Let <math>J \subseteq \{j^*-4 \leq j \leq j^*\}</math> (subset of possible weeks).<br />
<br />
<li> Assemble the Training Set <math>\{X_J, \mathbf{y}_{T,J}\}</math> .<br />
<br />
<li> Train the Gaussian process.<br />
<br />
<li> Calculate the '''distribution''' of <math>\hat{y}_{T,J}</math> using <math>p(\hat{y}_{T,J} | \mathbf{x}^*, X_J, \mathbf{y}_{T,J}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}_{T,J}),\sigma(\mathbf{x}^*,X_J))</math>.<br />
<br />
<li> Set <math>\hat{y}_{T,J} =\mu(x^*,X_J,\mathbf{y}_{T,J})</math>.<br />
<br />
<li> Repeat steps 2-5 for all sets of weeks <math>J</math>.<br />
<br />
<li> Determine the best 3 performing sets J (on the 2010/11 and 2011/12 validation sets).<br />
<br />
<li> Calculate the ensemble forecast by averaging the 3 best performing predictive distribution densities i.e. <math>\hat{y}_T = \frac{1}{3}\sum_{k=1}^3 \hat{y}_{T,J_{best}}</math>.<br />
<br />
</ol><br />
<br />
== Results ==<br />
<br />
To demonstrate the accuracy of their results, retrospective forecasting was done on the ILI data-set. Retrospective forecasting involves using information from a specific time period in the past for example a certain week of the previous year and tries to forecast future values, for example, the next week's cases. Then, the forecasting accuracy can be evaluated on the actual observed value. In other words, the Gaussian process model was trained to assume a previous season (2012/13) was the current season. In this fashion, the forecast could be compared to the already observed true outcome. <br />
<br />
To produce a forecast for the entire 2012/13 season, 30 Gaussian processes were trained (each influenza season has 30 test points <math>\mathbf{x^*}</math>) and a curve connecting the predicted outputs <math>y_T = \hat{f}(\mathbf{x^*)}</math> was plotted ('''Fig.2''', blue line). As shown in '''Fig.2''', this forecast (blue line) was reliable for both 1 (left) and 3 (right) week targets, given that the 95% prediction interval ('''Fig.2''', purple shaded) contained the true values ('''Fig.2''', red x's) 95% of the time.<br />
<br />
<div style="text-align:center;"> [[File:ResultsOne.png|600px]] </div><br />
<br />
<div align="center">'''Figure 2. Retrospective forecasts and their uncertainty''': <br />
<div align="center">One week retrospective influenza forecasting for two targets (T = 1, 3). Red x’s are the true observed values, and blue lines and purple shaded areas represent point forecasts and 95% prediction intervals, respectively. </div><br />
</div><br />
<br />
Moreover, as shown in '''Fig.3''', the novel Gaussian process regression framework outperformed all state-of-the-art models, included LinEns, for four different targets <math>(T = 1,\ldots, 4)</math>, when compared using the official CDC scoring criterion ''log-score''. Log-score describes the logarithmic probability of the forecast being within an interval around the true value. <br />
<br />
<div style="text-align:center;"> [[File:ComparisonNew.png|600px]] </div><br />
<br />
<div align="center">'''Figure 3. Average log-score gain of proposed framework''': <br />
<div align="center"> Each bar shows the mean seasonal log-score gain of the proposed framework vs. the given state-of-the-art model, and each panel corresponds to a different target week <math> T = 1,...4 </math>. </div><br />
</div><br />
<br />
== Conclusion ==<br />
<br />
This paper presented a novel framework for forecasting seasonal epidemics using Gaussian process regression that outperformed multiple state-of-the-art forecasting methods on the CDC's ILI data-set. Since influenza data has only been collected from 2003, therefore as more data is obtained: the training models can be further improved upon with more robust model optimizations. Hence, this work may play a key role in future influenza forecasting and as result, the improvement of public health policies and resource allocation.<br />
<br />
== Critique ==<br />
<br />
The proposed framework provides a computationally efficient method to forecast any seasonal epidemic count data that is easily extendable to multiple target types. In particular, one can compute key parameters such as the peak infection incidence <math>\left(\hat{y} = \text{max}_{0 \leq j \leq 52} d^i_j \right)</math>, the timing of the peak infection incidence <math>\left(\hat{y} = \text{argmax}_{0 \leq j \leq 52} d^i_j\right)</math> and the final epidemic size of a season <math>\left(\hat{y} = \sum_{j=1}^{52} d^i_j\right)</math>. However, given it is not a physical model, it cannot provide insights on parameters describing the disease spread. Moreover, the framework requires training data and hence, is not applicable for non-seasonal epidemics.<br />
<br />
Consider the Gaussian Process Regression, we can apply it on the COVID-19. Hence, we can add the data obtained from COVID-19 to the data-set part. Then apply the algorithm that was outlined from the project. We can predict the high wave period to better protect ourselves. But the thing is how can we obtain the training data from COVID-19 since there will be many missing values since many people may not know whether they have been affected or not. Thus, it will be a big problem to obtain the best training data.<br />
<br />
There is one Gaussian property which makes most of its models extremely tractable. A linear combination of two or more Gaussian distributions is still a Gaussian distribution. It's very easy to determine the mean and variance of the newly constructed Gaussian distribution. But there's a constraint of applying this property to several Gaussian distributions. All Gaussian distributions should be linearly independent of others, otherwise, the linear combination distribution might not be Gaussian anymore.<br />
<br />
This paper provides a state of the art approach for forecasting epidemics. It would have been interesting to see other types of kernels being used, such as a periodic kernel <math>k(x, x') = \sigma^2 \exp\left({-\frac{2 \sin^2 (\pi|x-x'|)/p}{l^2}}\right) </math>, as intuitively epidemics are known to have waves within seasons. This may have resulted in better-calibrated uncertainty estimates as well.<br />
<br />
It is mentioned that the the framework might not be good for non-seasonal epidemics because it requires training data, given that the COVID-19 pandemic comes in multiple waves and we have enough data from the first wave and second wave, we might be able to use this framework to predict the third wave and possibly the fourth one as well. It'd be interesting to see this forecasting framework being trained using the data from the first and second wave of COVID-19.<br />
<br />
Gaussian Process Regression (GPR) can be extended to COVID-19 outbreak forecast [3]. A Multi-Task Gaussian Process Regression (MTGP) model is a special case of GPR that yields multiple outputs. It was first proposed back in 2008, and its recent application was in the battery capacity prediction by Richardson et. al. The proposed MTGP model was applied on Worldwide, China, India, Italy, and USA data that covers COVID-19 statistics from Dec 31 2019 to June 25 2020. Results show that MTGP outperforms linear regression, support vector regression, random forest regression, and long short-term memory model when it comes to forecasting accuracy.<br />
<br />
A critique for GPR is that it does not take into account the physicality of the epidemic process. So it would be interesting to investigate if we can incorporate elements of physical processes in this such as using Ordinary Differential Equations (ODE) based modeling and seeing if that improves prediction accuracy or helps us explain phenomena.<br />
<br />
It will be interesting to provide a comparison of methods described in this paper with optimal control theory for modelling infectious disease.<br />
<br />
This is a very nice summary and everything was discussed clearly. I think it will be interesting that apply GPR by using the data in COVID-19, maybe this will be helpful to give a sense that how likely that 3rd wave of COVID-19 will happen. Maybe it will be better to discuss more that a training data set from a location will not train a good model for other countries. (p.s. I think <math> \sigma </math> is the standard deviation, sometimes saying variance function with the notation <math> \sigma </math> confused me.).<br />
<br />
== References ==<br />
<br />
[1] Estimated Influenza Illnesses, Medical visits, Hospitalizations, and Deaths in the United States - 2019–2020 Influenza Season. (2020). Retrieved November 16, 2020, from https://www.cdc.gov/flu/about/burden/2019-2020.html<br />
<br />
[2] Ray, E. L., Sakrejda, K., Lauer, S. A., Johansson, M. A.,and Reich, N. G. (2017).Infectious disease prediction with kernel conditional densityestimation.Statistics in Medicine, 36(30):4908–4929.<br />
<br />
[3] Ketu, S., Mishra, P.K. Enhanced Gaussian process regression-based forecasting model for COVID-19 outbreak and significance of IoT for its detection. Appl Intell (2020). https://doi.org/10.1007/s10489-020-01889-9<br />
<br />
[4] Schulz, E., Speekenbrink, M., and Krause, A. (2017).A tutorial on gaussian process regression with a focus onexploration-exploitation scenarios.bioRxiv.<br />
<br />
[5] Zimmer, C., Leuba, S. I., Cohen, T., and Yaesoubi, R.(2019).Accurate quantification of uncertainty in epidemicparameter estimates and predictions using stochasticcompartmental models.Statistical Methods in Medical Research,28(12):3591–3608.PMID: 30428780.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=46600Task Understanding from Confusing Multi-task Data2020-11-26T07:31:04Z<p>A29mukhe: /* References */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, and even self-driving cars. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates identifies an input to its respective task and the latter finds the relationship between the input and its label. See figure 1(b). To train a network of CSL, CSL-Net is constructed for representing CSL’s variables. However, this structure cannot be optimized by gradient back-propagation. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization. <br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine sigh complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. In multi-task learning, the task to which every sample belongs is known. With this task definition, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. Without this annotation, our goal is to understand the task concept from confusing input-label pairs. Overall, It requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, samples are generated from multiple distributions instead of one distribution. Thus, latent variable learning is insufficient to solve the research problem, which is multi-task confusing samples.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, let <math> (x,y)</math> be the training samples from <math>y=f(x)</math>, which is an identical but unknown mapping relationship. Assuming the risk measure is mean squared error (MSE), the expected risk functional is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the data distribution of the input variable <math>x</math>. In practice, model optimizations are performed using the empirical risk<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk functional can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to the output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to recover both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of the MNIST digit data that has been colored. Each observation in this modified set consists of a colored image (<math>x_i</math>) and either the color, or the digit it represents (<math>y_i</math>). The goal is to recover the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': This data set has more observations than the "colored-MNIST" data and consists of pictures labeled with either the “Gender”, “Category”, and “Color” of the clothing item.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': In order to "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labeling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
<br />
Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reaction from text.<br />
<br />
Multi-label can be used to improve the syndrome diagnosis of a patient by focusing on multiple syndromes instead of a single syndrome.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, there are some limitations that can be improved for future work:<br />
The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; The current algorithm is difficult to learn basic features directly through the CNN structure, and it is necessary to learn and train a fully connected network based on CNN features in the experiment.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed for converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
When using this framework for classification, the order of the one-hot classification labels for each task will likely influence the relationships learned between each task, since the same output header is used for all tasks. This may be why this method fails to learn low level representations, and requires pretraining. I would like to see more explanation in the paper about why this isn't a problem, if it was investigated.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.<br />
<br />
[6] Guo-Ping Liu, Jian-Jun Yan, Yi-Qin Wang, Jing-Jing Fu, Zhao-Xia Xu, Rui Guo, Peng Qian, "Application of Multilabel Learning Using the Relevant Feature for Each Label in Chronic Gastritis Syndrome Diagnosis", Evidence-Based Complementary and Alternative Medicine, vol. 2012, Article ID 135387, 9 pages, 2012. https://doi.org/10.1155/2012/135387</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=46599Task Understanding from Confusing Multi-task Data2020-11-26T07:30:16Z<p>A29mukhe: /* Application of Multi-label Learning */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, and even self-driving cars. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates identifies an input to its respective task and the latter finds the relationship between the input and its label. See figure 1(b). To train a network of CSL, CSL-Net is constructed for representing CSL’s variables. However, this structure cannot be optimized by gradient back-propagation. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization. <br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine sigh complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. In multi-task learning, the task to which every sample belongs is known. With this task definition, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. Without this annotation, our goal is to understand the task concept from confusing input-label pairs. Overall, It requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, samples are generated from multiple distributions instead of one distribution. Thus, latent variable learning is insufficient to solve the research problem, which is multi-task confusing samples.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, let <math> (x,y)</math> be the training samples from <math>y=f(x)</math>, which is an identical but unknown mapping relationship. Assuming the risk measure is mean squared error (MSE), the expected risk functional is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the data distribution of the input variable <math>x</math>. In practice, model optimizations are performed using the empirical risk<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk functional can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to the output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to recover both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of the MNIST digit data that has been colored. Each observation in this modified set consists of a colored image (<math>x_i</math>) and either the color, or the digit it represents (<math>y_i</math>). The goal is to recover the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': This data set has more observations than the "colored-MNIST" data and consists of pictures labeled with either the “Gender”, “Category”, and “Color” of the clothing item.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': In order to "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labeling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
<br />
Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reaction from text.<br />
<br />
Multi-label can be used to improve the syndrome diagnosis of a patient by focusing on multiple syndromes instead of a single syndrome.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, there are some limitations that can be improved for future work:<br />
The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; The current algorithm is difficult to learn basic features directly through the CNN structure, and it is necessary to learn and train a fully connected network based on CNN features in the experiment.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed for converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
When using this framework for classification, the order of the one-hot classification labels for each task will likely influence the relationships learned between each task, since the same output header is used for all tasks. This may be why this method fails to learn low level representations, and requires pretraining. I would like to see more explanation in the paper about why this isn't a problem, if it was investigated.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Yktan&diff=46596User:Yktan2020-11-26T06:31:59Z<p>A29mukhe: /* Critiques/ Insights */</p>
<hr />
<div>== Presented by == <br />
Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew<br />
<br />
== Introduction ==<br />
<br />
Much of the success in training deep neural networks (DNNs) is thanks to the collection of large datasets with human annotated labels. However, human annotation is both a time-consuming and expensive task, especially for the data that requires expertise such as medical data. Furthermore, certain datasets will be noisy due to the biases introduced by different annotators.<br />
<br />
There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. An example of a popular loss correction approach is the bootstrapping loss approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples.<br />
<br />
This paper introduces DivideMix, which combines approaches from LNL and SSL. One unique thing about DivideMix is that it discards sample labels that are highly likely to be noisy and leverages these noisy samples as unlabeled data instead. This prevents the model from overfitting and improves generalization performance. Key contributions of this work are:<br />
1) Co-divide, which trains two networks simultaneously, aims to improve generalization and avoiding confirmation bias.<br />
2) During SSL phase, an improvement is made on an existing method (MixMatch) by combining it with another method (MixUp).<br />
3) Significant improvements to state-of-the-art results on multiple conditions are experimentally shown while using DivideMix. Extensive ablation study and qualitative results are also shown to examine the effect of different components.<br />
<br />
== Motivation ==<br />
<br />
While much has been achieved in training DNNs with noisy labels and SSL methods individually, not much progress has been made in exploring their underlying connections and building on top of the two approaches simultaneously. <br />
<br />
Existing LNL methods aim to correct the loss function by:<br />
<ol><br />
<li> Treating all samples equally and correcting loss explicitly or implicitly through relabeling of the noisy samples<br />
<li> Reweighting training samples or separating clean and noisy samples, which results in correction of the loss function<br />
</ol><br />
<br />
A few examples of LNL methods include:<br />
<ol><br />
<li> Estimating the noise transition matrix to correct the loss function<br />
<li> Leveraging DNNs to correct labels and using them to modify the loss<br />
<li> Reweighting samples so that noisy labels contribute less to the loss<br />
</ol><br />
<br />
However, these methods each have some downsides. For example, it is very challenging to correctly estimate the noise transition matrix in the first method; for the second method, DNNs tend to overfit to datasets with high noise ratio; for the third method, we need to be able to identify clean samples, which has also proven to be challenging.<br />
<br />
On the other hand, SSL methods mostly leverage unlabeled data using regularization to improve model performance. A recently proposed method, MixMatch incorporates the two classes of regularization – consistency regularization which enforces the model to produce consistent predictions on augmented input data, and entropy minimization which encourages the model to give high-confidence predictions on unlabeled data, as well as MixUp regularization. <br />
<br />
DivideMix partially adopts LNL in that it removes the labels that are highly likely to be noisy by using co-divide to avoid the confirmation bias problem. It then utilizes the noisy samples as unlabeled data and adopts an improved version of MixMatch (SSL) which accounts for the label noise during the label co-refinement and co-guessing phase. By incorporating SSL techniques into LNL and taking the best of both worlds, DivideMix aims to produce highly promising results in training DNNs by better addressing the confirmation bias problem, more accurately distinguishing and utilizing noisy samples, and performing well under high levels of noise.<br />
<br />
== Model Architecture ==<br />
<br />
DivideMix leverages semi-supervised learning to achieve effective modeling. The sample is first split into a labeled set and an unlabeled set. This is achieved by fitting a Gaussian Mixture Model as a per-sample loss distribution. The unlabeled set is made up of data points with discarded labels deemed noisy. Then, to avoid confirmation bias, which is typical when a model is self-training, two models are being trained simultaneously to filter error for each other. This is done by dividing the data using one model and then training the other model. This algorithm, known as Co-divide, keeps the two networks from converging when training, which avoids the bias from occurring. Figure 1 describes the algorithm in graphical form.<br />
<br />
[[File:ModelArchitecture.PNG | center]]<br />
<br />
<div align="center">Figure 1: Model Architecture of DivideMix</div><br />
<br />
For each epoch, the network divides the dataset into a labeled set consisting of clean data, and an unlabeled set consisting of noisy data, which is then used as training data for the other network, where training is done in mini-batches. For each batch of the labelled samples, co-refinement is performed by using the ground truth label <math> y_b </math>, the predicted label <math> p_b </math>, and the posterior is used as the weight, <math> w_b </math>. <br />
<br />
<center><math> \bar{y}_b = w_b y_b + (1-w_b) p_b </math></center> <br />
<br />
Then, a sharpening function is implemented on this weighted sum to produce the estimate, <math> \hat{y}_b </math>. Using all these predicted labels, the unlabeled samples will then be assigned a "co-guessed" label, which should produce a more accurate prediction. Having calculated all these labels, MixMatch is applied to the combined mini-batch of labeled, <math> \hat{X} </math> and unlabeled data, <math> \hat{U} </math>, where, for a pair of samples and their labels, one new sample and new label is produced. More specifically, for a pair of samples <math> (x_1,x_2) </math> and their labels <math> (p_1,p_2) </math>, the mixed sample <math> (x',p') </math> is:<br />
<br />
<center><br />
<math><br />
\begin{alignat}{2}<br />
<br />
\lambda &\sim Beta(\alpha, \alpha) \\<br />
\lambda ' &= max(\lambda, 1 - \lambda) \\<br />
x' &= \lambda ' x_1 + (1 - \lambda ' ) x_2 \\<br />
p' &= \lambda ' p_1 + (1 - \lambda ' ) p_2 \\<br />
<br />
\end{alignat}<br />
</math><br />
</center> <br />
<br />
MixMatch transforms <math> \hat{X} </math> and <math> \hat{U} </math> into <math> X' </math> and <math> U' </math>. Then, the loss on <math> X' </math>, <math> L_X </math> (Cross-entropy loss) and the loss on <math> U' </math>, <math> L_U </math> (Mean Squared Error) are calculated. A regularization term, <math> L_{reg} </math>, is introduced to regularize the model's average output across all samples in the mini-batch. Then, the total loss is calculated as:<br />
<br />
<center><math> L = L_X + \lambda_u L_U + \lambda_r L_{reg} </math></center> ,<br />
<br />
where <math> \lambda_r </math> is set to 1, and <math> \lambda_u </math> is used to control the unsupervised loss.<br />
<br />
Lastly, the stochastic gradient descent formula is updated with the calculated loss, <math> L </math>, and the estimated parameters, <math> \boldsymbol{ \theta } </math>.<br />
<br />
== Results ==<br />
'''Applications'''<br />
<br />
The method was validated using four benchmark datasets: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009)(both contain 50K training images and 10K test images of size 32 × 32), Clothing1M (Xiao et al., 2015), and WebVision (Li et al., 2017a).<br />
Two types of label noise are used in the experiments: symmetric and asymmetric.<br />
An 18-layer PreAct Resnet (He et al., 2016) is trained using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. The network is trained for 300 epochs. The initial learning rate was set to 0.02, and reduced by a factor of 10 after 150 epochs. Before applying the Co-divide and MixMatch strategies, the models were first independently trained over the entire dataset using cross-entropy loss during a "warm-up" period. Initially, training the models in this way prepares a more regular distribution of losses to improve upon in subsequent epochs. The warm-up period is 10 epochs for CIFAR-10 and 30 epochs for CIFAR-100. For all CIFAR experiments, we use the same hyperparameters M = 2, T = 0.5, and α = 4. τ is set as 0.5 except for 90% noise ratio when it is set as 0.6.<br />
<br />
<br />
'''Comparison of State-of-the-Art Methods'''<br />
<br />
The effectiveness of DivideMix was shown by comparing the test accuracy with the most recent state-of-the-art methods: <br />
Meta-Learning (Li et al., 2019) proposes a gradient-based method to find model parameters that are more noise-tolerant; <br />
Joint-Optim (Tanaka et al., 2018) and P-correction (Yi & Wu, 2019) jointly optimize the sample labels and the network parameters;<br />
M-correction (Arazo et al., 2019) models sample loss with BMM and apply MixUp.<br />
The following are the results on CIFAR-10 and CIFAR-100 with different levels of symmetric label noise ranging from 20% to 90%. Both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs were recorded in the following table:<br />
<br />
<br />
[[File:divideMixtable1.PNG | center]]<br />
<br />
From table1, the author noticed that none of these methods can consistently outperform others across different datasets. M-correction excels at symmetric noise, whereas Meta-Learning performs better for asymmetric noise. DivideMix outperforms state-of-the-art methods by a large margin across all noise ratios. The improvement is substantial (∼10% of accuracy) for the more challenging CIFAR-100 with high noise ratios.<br />
<br />
DivideMix was compared with the state-of-the-art methods with the other two datasets: Clothing1M and WebVision. It also shows that DivideMix consistently outperforms state-of-the-art methods across all datasets with different types of label noise. For WebVision, DivideMix achieves more than 12% improvement in top-1 accuracy. <br />
<br />
<br />
'''Ablation Study'''<br />
<br />
The effect of removing different components to provide insights into what makes DivideMix successful. We analyze the results in Table 5 as follows.<br />
<br />
<br />
[[File:DivideMixtable5.PNG | center]]<br />
<br />
The authors find that both label refinement and input augmentation are beneficial for DivideMix.<br />
<br />
== Conclusion ==<br />
<br />
This paper provides a new and effective algorithm for learning with noisy labels by leveraging SSL. The DivideMix method trains two networks simultaneously and utilizes co-guessing and co-labeling effectively, therefore it is a robust approach to dealing with noise in datasets. DivideMix has also been tested using various datasets with the results consistently being one of the best when compared to other advanced methods.<br />
<br />
Future work of DivideMix is to create an adaptation for other applications such as Natural Language Processing, and incorporating the ideas of SSL and LNL into DivideMix architecture.<br />
<br />
== Critiques/ Insights ==<br />
<br />
1. While combining both models makes the result better, the author did not show the relative time increase using this new combined methodology, which is very crucial considering training a large amount of data, especially for images. In addition, it seems that the author did not perform much on hyperparameters tuning for the combined model.<br />
<br />
2. There is an interesting insight, which is when noise ratio increases from 80% to 90%, the accuracy of DivideMix drops dramatically in both datasets.<br />
<br />
3. There should be further explanation on why the learning rate drops by a factor of 10 after 150 epochs.<br />
<br />
== References ==<br />
Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Unsupervised<br />
label noise modeling and loss correction. In ICML, pp. 312–321, 2019.<br />
<br />
David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin<br />
Raffel. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 2019.<br />
<br />
Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach<br />
to learning from noisy labels. In WACV, pp. 1215–1224, 2018.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=46594User:Gtompkin2020-11-26T06:17:43Z<p>A29mukhe: /* Theoretical Analysis */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning and data science is dealing with missing and incomplete data. This paper proposes a theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and <math>k</math>-nearest neighbours (k-NN) imputation. In the former (mean imputation), the missing value is replaced by the mean of all available values of that feature in the dataset. In the latter (k-NN imputation), the missing value is also replaced by the mean, however it is now computed using only the k "closest" samples in the dataset. If the dataset is numerical, Euclidean distance could be used as a measure of "closeness", for example. Other methods for dealing with missing data involve training separate neural networks and extreme learning machines. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Further, the decision function can also be trained using available/visible inputs alone. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation. <br />
<br />
Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset \{1,...,D\} </math> is a set of attributes with missing data. <math>(x,J)</math> therefore represents an "incomplete" data point for which <math>|J|</math>-many entries are unknown - examples of this could be a list of daily temperature readings over a week where temperature was not recorded on the third day (<math>x\in \mathbb{R}^7, J= \{3\}</math>), an audio transcript that goes silent for certain timespans, or images that are partially masked out (as discussed in the examples).<br />
<br />
For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>: <br />
<br />
<center><math>S=Aff[x,J]=span(e_J) </math></center> <br />
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.<br />
<br />
Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>, <br />
<br />
<center><math>F_S(x) = \begin{cases}<br />
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\<br />
0 & \text{otherwise.}<br />
\end{cases} </math></center><br />
<br />
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as: <br />
<br />
<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center> <br />
provided that the expectation exists. <br />
<br />
The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively. <br />
<br />
'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have<br />
<br />
<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center> <br />
<br />
where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.<br />
<br />
'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have: <br />
<br />
<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center> <br />
<br />
In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.<br />
<br />
<math> </math><br />
<br />
== Theoretical Analysis ==<br />
<br />
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information. <br />
<br />
Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by <br />
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center><br />
<br />
'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If <br />
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</math></center> then <math>\nu = \mu</math>.<br />
<br />
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows. <br />
<br />
''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <center><math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}.</math></center> The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>). <br />
<br />
Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem. <br />
<br />
Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <center><math>exp(ir) = cos(r) + i sin(r) \in F_w</math></center> and we have the equality <center><math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x).</math></center> Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> as the characteristic functions of the two measures coincide. <br />
<br />
A result analogous to Theorem 4.1 for RBF can also be obtained.<br />
<br />
'''Theorem 2.1''' Let <math>\mu, \nu</math> be probabilistic measures. If<br />
$$ RBF_{m,\alpha}(\mu) = RBF_{m,\alpha}(\nu) \text{ for every } m \in \mathbb{R}^D, \alpha > 0,$$<br />
then <math>\nu = \mu</math>.<br />
<br />
More general results can be obtained making stronger assumptions on the probability measures. For example, if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.<br />
<br />
'''Theorem 2.2''' Let <math>\mu, \nu</math> be probabilistic measures with compact support. Let <math>\mathcal{N}</math> be a family of functions having UAP. If<br />
$$n(\mu) = n(\nu) \text{ for every } n \in \mathcal{N},$$<br />
then <math>\nu = \mu</math>.<br />
<br />
A detailed proof Theorems 2.1 and 2.2 can be found in section 2 of the Supplementary Materials, which can be downloaded [https://proceedings.neurips.cc/paper/2018/hash/411ae1bf081d1674ca6091f8c59a266f-Abstract.html here].<br />
<br />
== Experimental Results ==<br />
The model was applied to three types of algorithms: an Autoencoder (AE), a multilayer perceptron, and a radial basis function network.<br />
<br />
'''Autoencoder'''<br />
<br />
Corrupted images were restored as a part of this experiment. Grayscale handwritten digits were obtained from the MNIST database. A 13 by 13 (169 pixels) square was removed from each 28 by 28 (784 pixels) image. The location of the square was uniformly sampled for each image. The autoencoder used included 5 hidden layers. The first layer used ReLU activation functions while the subsequent layers utilized sigmoids. The loss function was computed using pixels from outside the mask. <br />
<br />
Popular imputation techniques were compared against the conducted experiment:<br />
<br />
''k-nn:'' Replaced missing features with the mean of respective features calculated using K nearest training samples. Here, K=5. <br />
<br />
''mean:'' Replaced missing features with the mean of respective features calculated using all incomplete training samples.<br />
<br />
''dropout:'' Dropped input neutrons with missing values. <br />
<br />
Moreover, a context encoder (CE) was trained by replacing missing features with their means. Unlike mean imputation, the complete data was used in the training phase. The method under study performed better than the imputation methods inside and outside the mask. Additionally, the method under study outperformed CE based on the whole area and area outside the mask. <br />
<br />
The mean square error of reconstruction is used to test each method. The errors calculated over the whole area, inside and outside the mask are shown in Table 1, which indicates the method introduced in this paper is the most competitive.<br />
<br />
[[File:Group3_Table1.png |center]]<br />
<div align="center">Table 1: Mean square error of reconstruction on MNIST incomplete images scaled to [0, 1]</div><br />
<br />
'''Multilayer Perceptron'''<br />
<br />
A multilayer perceptron with 3 ReLU hidden layers was applied to a multi-class classification problem on the Epileptic Seizure Recognition (ESR) data set taken from [3]. Each 178-dimensional vector (out of 11500 samples) is the EEG recording of a given person for 1 second, categorized into one of 5 classes. To generate missing attributes, 25%, 50%, 75%, and 90% of observations were randomly removed. The aforementioned imputation methods were used in addition to Multiple Imputation by Chained Equation (mice) and a mixture of Gaussians (gmm). The former utilizes the conditional distribution of data by Markov chain Monte Carlo techniques to draw imputations. The latter replaces missing features with values sampled from GMM estimated from incomplete data using the EM algorithm. <br />
<br />
Double 5-fold cross-validation was used to report classification results. The model under study outperformed classical imputation methods, which give reasonable results only for a low number of missing values. The method under study performs nearly as well as CE, even though CE had access to complete training data. <br />
<br />
'''Radial Basis Function Network'''<br />
<br />
RBFN can be considered as a minimal architecture implementing our model, which contains only one hidden layer. A cross-entropy function was applied to a softmax in the output layer. Two-class data sets retrieved from the UCI repository [4] with internally missing attributes were used. Since the classification is binary, two additional SVM kernel models which work directly with incomplete data without performing any imputations were included in the experiment:<br />
<br />
''geom:'' The objective function is based on the geometric interpretation of the margin and aims to maximize the margin of each sample in its own subspace [5].<br />
<br />
''karma:'' This algorithm iteratively tunes kernel classifier under low-rank assumptions [6].<br />
<br />
The above SVM methods were combined with RBF kernel function. The number of RBF units was selected in the inner cross-validation from the range {25, 50, 75, 100}. Initial centers of RBFNs were randomly selected from training data while variances were samples from <math>N(0,1)</math> distribution. For SVM methods, the margin parameter <math>C</math> and kernel radius <math>\gamma</math> were selected from <math>\{2^k :k=−5,−3,...,9\}</math> for both parameters. For karma, additional parameter <math>\gamma_{karma}</math> was selected from the set <math>\{1, 2\}</math>.<br />
<br />
The model under study outperformed imputation techniques in almost all cases. It partially confirms that the use of raw incomplete data in neural networks is a usually better approach than filling missing attributes before the learning process. Moreover, it obtained more accurate results than modified kernel methods, which directly work on incomplete data. The performance of the model was once again comparable to, and in some cases better than CE, which had access to the complete data.<br />
<br />
== Conclusion ==<br />
<br />
The results with these experiments along with the theoretical results conclude that this novel approach for dealing with missing data through a modification of a neural network is beneficial and outperforms many existing methods. This approach, which utilizes representing missing data with a probability density function, allows a neural network to determine a more generalized and accurate response of the neuron.<br />
<br />
== Critiques ==<br />
<br />
- A simulation study where the mechanism of missingness is known will be interesting to examine. Doing this will allow us to see when the proposed method is better than existing methods, and under what conditions.<br />
<br />
- This method of imputing incomplete data has many limitations: in most cases when we have a missing data point we are facing a relatively small amount of data that does not require training of a neural network. For a large dataset, missing records does not seem to be very crucial because obtaining data will be relatively easier, and using an empirical way of imputing data such as a majority vote will be sufficient.<br />
<br />
- An interesting application of this problem is in NLP. In NLP, especially Question Answering, there is a problem where a query is given and an answer must be retrieved, but the knowledge base is incomplete. There is therefore a requirement for the model to be able to infer information from the existing knowledge base in order to answer the question. Although this problem is a little more contrived than the one mentioned here, it is nevertheless similar in nature because it requires the ability to probabilistically determine some value which can then be used as a response.<br />
<br />
- The experiments in this paper evaluate this method against low amounts of missing data. It would be interesting to see the properties of this imputation when a majority of the data is missing, and see if this method can outperform dropout training in this setting (dropout is known to be surprisingly robust even at high drop levels).<br />
<br />
- This problem can possibly be applied to face recognition where given a blurry image of a person's face, the neural network can make the image clearer such that the face of the person would be visible for humans to see and also possible for the software to identify who the person is.<br />
<br />
- This novel approach can also be applied to restoring damaged handwritten historical documents. By feeding in a damaged document with portions of unreadable texts, the neural network can add missing information utilizing a trained context encoder<br />
<br />
- It will be interesting to see how this method performs with audio data, i.e. say if there are parts of an audio file that are missing, whether the neural network will be able to learn the underlying distribution and impute the missing sections of speech.<br />
<br />
- In general, data are usually missing in a specific part of the content. For example, old books usually have first couple page or last couple pages that are missing. It would be interesting to see that how the distribution of "missing data" will be applied in those cases.<br />
<br />
- In this paper, the researchers were able to outperform existing imputation methods using neural networks. It would be really nice to see how does the usage of neural networks impact the need for amount of data, and how much more training is required in comparison to the other algorithms provided in this paper. <br />
<br />
- It might be an interesting approach to investigate how the size of missing data may influence the training. For example, in the MNIST AutoEncoder, some algorithms involve masks to generate more general images and avoid overfitting. The approach could compare the result by changing the size of the "missing" part to illustrate to what degree can we ignore the missing data and view them as assistance.<br />
<br />
- It would be nice to see how the method under study outperforms both the Multilayer Perceptron and Radial Basis Function Network methods mentioned in the summary. Since these two methods are placed under the Experimental Results section, it is expected that they consist more justifications and supporting evidence (analytical results, tables, graphs, etc.) then what have been presented, so that it is more convincing for the readers.<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.<br />
<br />
[3] Ralph G Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and Christian E Elger. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical Review E, 64(6):061907, 2001.<br />
<br />
[4] Arthur Asuncion and David J. Newman. UCI Machine Learning Repository, 2007.<br />
<br />
[5] Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, and Daphne Koller. Max-margin classification of data with absent features. Journal of Machine Learning Research, 9:1–21, 2008.<br />
<br />
[6] Elad Hazan, Roi Livni, and Yishay Mansour. Classification with low rank and missing data. In Proceedings of The 32nd International Conference on Machine Learning, pages 257–266, 2015.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Describtion_of_Text_Mining&diff=46587Describtion of Text Mining2020-11-26T05:39:53Z<p>A29mukhe: /* Clustering */</p>
<hr />
<div>== Presented by == <br />
Yawen Wang, Danmeng Cui, Zijie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid<br />
<br />
== Introduction ==<br />
This paper focuses on the different text mining tasks and the existence of text mining in healthcare and biomedical domains. The text mining field has been popular as a result of the amount of text data that is available in different forms. The text data is bound to grow even more in 2020, indicating a 50 times growth since 2010. Text is a kind of unstructured information, which is easy for humans to construct and understand, but it is difficult for machines. Hence, there is a need to design algorithms to effectively process this avalanche of text. To further explore the text mining field, the related text mining approaches can be considered. The different text mining approaches relate to two main methods: knowledge delivery and traditional data mining methods. <br />
<br />
The authors note that knowledge delivery methods involve the application of different steps to a specific data set to create specific patterns. Research in knowledge delivery methods has evolved over the years due to advances in hardware and software technology. On the other hand, data mining has experienced substantial development through the intersection of three fields: databases, machine learning, and statistics. As brought out by the authors, text mining approaches focus on the exploration of information from a specific text. The information explored is in the form of structured, semi-structured, and unstructured text. It is important to note that text mining covers different sets of algorithms and topics that include information retrieval. The topics and algorithms are used for analyzing different text forms.<br />
<br />
== Text Representation and Encoding ==<br />
In this section of the paper, the authors explore the different ways in which the text can be represented on a large collection of documents. One common way of representing the documents is in the form of a bag of words. The bag of words considers the occurrences of different terms. In different text mining applications, documents are ranked and represented as vectors so as to display the significance of any word. The authors note that the three basic models used are vector space, inference network, and the probabilistic models. The vector space model is used to represent documents by converting them into vectors. In the model, a variable is used to represent each model to indicate the importance of the word in the document. The words are weighted using the TF-IDF scheme computed as <br />
<br />
$$<br />
q(w)=f_d(w)*log{\frac{|D|}{f_D(w)}}<br />
$$<br />
<br />
In many text mining algorithms, one of the key components is preprocessing. Preprocessing consists of different tasks that include filtering, tokenization, stemming, and lemmatization. The first step is tokenization, where a character sequence is broken down into different words or phrases. After the breakdown, filtering is carried out to remove some words. The various word inflected forms are grouped together through lemmatization, and later, the derived roots of the derived words are obtained through stemming.<br />
<br />
== Classification ==<br />
Classification in Text Mining aims to assigned predefined classes to text documents. For a set <math>\mathcal{D} = {d_1, d_2, ... d_n}</math> of documents, such that each <math>d_i</math> is mapped to a label <math>l_i</math> from the set <math>\mathcal{L} = {l_1, l_2, ... l_k}</math>. The goal is to find a classification model <math>f</math> such that: <math>\\</math><br />
$$<br />
f: \mathcal{D} \rightarrow \mathcal{L} \quad \quad \quad f(\mathcal{d}) = \mathcal{l}<br />
$$<br />
The author illustrates 4 different classifiers that are commonly used in text mining.<br />
<br />
<br />
'''1. Naive Bayes Classifier''' <br />
<br />
Bayes rule is used to classify new examples and select the class that is most has the generated result. <br />
Naive Bayes Classifier models the distribution of documents in each class using a probabilistic model assuming that the distribution<br />
of different terms is independent of each other. The models commonly used in this classifier tried to find the posterior probability of a class based on the distribution and assumes that the documents generated are based on a mixture model parameterized by <math>\theta</math> and compute the likelihood of a document using the sum of probabilities over all mixture component.<br />
<br />
'''2. Nearest Neighbour Classifier'''<br />
<br />
Nearest Neighbour Classifier uses distance-based measures to perform the classification. The documents which belong to the same class are more likely "similar" or close to each other based on the similarity measure. The classification of the test documents is inferred from the class labels of similar documents in the training set.<br />
<br />
'''3. Decision Tree Classifier'''<br />
<br />
A hierarchical tree of the training instances, in which a condition on the attribute value is used to divide the data hierarchically. The decision tree recursively partitions the training data set into smaller subdivisions based on a set of tests defined at each node or branch. Each node of the tree is a test of some attribute of the training instance, and each branch descending from the node corresponds to one of the values of this attribute. The conditions on the nodes are commonly defined by the terms in the text documents.<br />
<br />
'''4. Support Vector Machines'''<br />
<br />
SVM is a form of Linear Classifiers which are models that makes a classification decision based on the value of the linear combinations of the documents features. The output of a linear predictor is defined to the <math> y=\vec{a} \cdot \vec{x} + b</math> where <math>\vec{x}</math> is the normalized document word frequency vector, <math>\vec{a}</math> is a vector of coefficient and <math>b</math> is a scalar. Support Vector Machines attempts to find a linear separators between various classes. An advantage of the SVM method is it is robust to high dimensionality.<br />
<br />
== Clustering ==<br />
Clustering has been extensively studied in the context of the text as it has a wide range of applications such as visualization and document organization.<br />
<br />
Clustering algorithms are used to group similar documents and thus aids in information retrieval.<br />
<br />
== Information Extraction ==<br />
Information Extraction (IE) is the process of extracting useful, structured information from unstructured or semi-structured text. It automatically extracts based on our command. <br />
<br />
For example, consider the following sentence, “XYZ company was founded by Peter in the year of 1950”<br />
We can identify the following information:<br />
<br />
Founderof(Peter, XYZ)<br />
Foundedin(1950, XYZ)<br />
<br />
The author mentioned 4 parts that are important for Information Extraction<br />
<br />
'''1. Namely Entity Recognition(NER)'''<br />
This is the process of identifying real-world entity from free text, such as "Apple Inc.", "Donald Trump", "PlayStation 5" etc. Moreover, the task is to identify the category of these entities, such as "Apple Inc." is in the category of the company, "Donald Trump" is in the category of the USA president, and "PlayStation 5" is in the category of the entertainment system. <br />
<br />
'''2. Hidden Markov Model'''<br />
Since traditional probabilistic classification does not consider the predicted labels of neighbor words, we use the Hidden Markov Model when doing Information Extraction. This model is different because it considers the label of one word depends on the previous words that appeared. <br />
<br />
'''3. Conditional Random Fields'''<br />
This is a technique that is widely used in Information Extraction. The definition of it is related to graph theory. <br />
let G = (V, E) be a graph and Yv stands for the index of the vertices in G. Then (X, Y) is a conditional random field, when the random variables Yv, conditioned on X, obey Markov property with respect to the graph, and:<br />
p(Yv |X, Yw ,w , v) = p(Yv |X, Yw ,w ∼ v), where w ∼ v means w and v are neighbors in G.<br />
<br />
'''4. Relation Extraction'''<br />
This is a task of finding semantic relationships between word entities in text documents. Such as "Seth Curry" is the brother of "Stephen Curry", if there is a document including these two names, the task is to identify the relationship of these two entities.<br />
<br />
== Biomedical Application ==<br />
<br />
Text mining has several applications in the domain of biomedical sciences. The explosion of academic literature in the field has made it quite hard for scientists to keep up with novel research. This is why text mining techniques are ever so important in making the knowledge digestible.<br />
<br />
The text mining techniques are able to extract meaningful information from large data by making use of biomedical ontology, which is a compilation of a common set of terms used in an area of knowledge. The Unified Medical Language System (UMLS) is the most comprehensive such resource, consisting of definitions of biomedical jargon. Several information extraction algorithms rely on the ontology to perform tasks such as Named Entity Recognition (NER) and Relation Extraction.<br />
<br />
NER involves locating and classifying biomedical entities into meaningful categories and assigning semantic representation to those entities. The NER methods can be broadly grouped into Dictionary-based, Rule-based and Statistical approaches. Relation extraction, on the other hand, is the process of determining relationships between the entities. This is accomplished mainly by identifying the correlation between entities through analyzing the frequency of terms, as well as rules defined by domain experts. Moreover, modern algorithms are also able to summarize large documents and answer natural language questions posed by humans.<br />
<br />
== Conclusion ==<br />
<br />
This paper gave a holistic overview of the methods and applications of text mining, particularly its relevance in the biomedical domain. It highlights several popular algorithms and summarizes them along with their advantages, limitations and some potential situations where they could be used. Because of ever-growing data, for example, the very high volume of scientific literature being produced every year, the interest in this field is massive and is bound to grow in the future.<br />
<br />
== References ==<br />
<br />
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering, and extraction techniques. arXiv preprint arXiv:1707.02919.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_ODEs&diff=46578Neural ODEs2020-11-26T05:24:39Z<p>A29mukhe: /* References */</p>
<hr />
<div>== Introduction ==<br />
Chen et al. propose a new class of neural networks called neural ordinary differential equations (ODEs) in their 2018 paper under the same title. Neural network models, such as residual or recurrent networks, can be generalized as a set of transformations through hidden states (a.k.a layers) <math>\mathbf{h}</math>, given by the equation <br />
<br />
<div style="text-align:center;"><math> \mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t,\theta_t) </math> (1) </div><br />
<br />
where <math>t \in \{0,...,T\}</math> and <math>\theta_t</math> corresponds to the set of parameters or weights in state <math>t</math>. It is important to note that it has been shown (Lu et al., 2017)(Haber<br />
and Ruthotto, 2017)(Ruthotto and Haber, 2018) that Equation 1 can be viewed as an Euler discretization. Given this Euler description, if the number of layers and step size between layers are taken to their limits, then Equation 1 can instead be described continuously in the form of the ODE, <br />
<br />
<div style="text-align:center;"><math> \frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t),t,\theta) </math> (2). </div><br />
<br />
Equation 2 now describes a network where the output layer <math>\mathbf{h}(T)</math> is generated by solving for the ODE at time <math>T</math>, given the initial value at <math>t=0</math>, where <math>\mathbf{h}(0)</math> is the input layer of the network. <br />
<br />
With a vast amount of theory and research in the field of solving ODEs numerically, there are a number of benefits to formulating the hidden state dynamics this way. One major advantage is that a continuous description of the network allows for the calculation of <math>f</math> at arbitrary intervals and locations. The authors provide an example in section five of how the neural ODE network outperforms the discretized version i.e. residual networks, by taking advantage of the continuity of <math>f</math>. A depiction of this distinction is shown in the figure below. <br />
<br />
<div style="text-align:center;"> [[File:NeuralODEs_Fig1.png|350px]] </div><br />
<br />
In section four the authors show that the single-unit bottleneck of normalizing flows can be overcome by constructing a new class of density models that incorporates the neural ODE network formulation.<br />
The next section on automatic differentiation will describe how utilizing ODE solvers allows for the calculation of gradients of the loss function without storing any of the hidden state information. This results in a very low memory requirement for neural ODE networks in comparison to traditional networks that rely on intermediate hidden state quantities for backpropagation.<br />
<br />
== Reverse-mode Automatic Differentiation of ODE Solutions ==<br />
Like most neural networks, optimizing the weight parameters <math>\theta</math> for a neural ODE network involves finding the gradient of a loss function with respect to those parameters. Differentiating in the forward direction is a simple task, however, this method is very computationally expensive and unstable, as it introduces additional numerical error. Instead, the authors suggest that the gradients can be calculated in the reverse-mode with the adjoint sensitivity method (Pontryagin et al., 1962). This "backpropagation" method solves an augmented version of the forward ODE problem but in reverse, which is something that all ODE solvers are capable of. Section 3 provides results showing that this method gives very desirable memory costs and numerical stability. <br />
<br />
The authors provide an example of the adjoint method by considering the minimization of the scalar-valued loss function <math>L</math>, which takes the solution of the ODE solver as its argument.<br />
<br />
<div style="text-align:center;">[[File:NeuralODEs_Eq1.png|700px]],</div> <br />
This minimization problem requires the calculation of <math>\frac{\partial L}{\partial \mathbf{z}(t_0)}</math> and <math>\frac{\partial L}{\partial \theta}</math>.<br />
<br />
The adjoint itself is defined as <math>\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{z}(t)}</math>, which describes the gradient of the loss with respect to the hidden state <math>\mathbf{z}(t)</math>. By taking the first derivative of the adjoint, another ODE arises in the form of,<br />
<br />
<div style="text-align:center;"><math>\frac{d \mathbf{a}(t)}{dt} = -\mathbf{a}(t)^T \frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \mathbf{z}}</math> (3).</div> <br />
<br />
Since the value <math>\mathbf{a}(t_0)</math> is required to minimize the loss, the ODE in equation 3 must be solved backwards in time from <math>\mathbf{a}(t_1)</math>. Solving this problem is dependent on the knowledge of the hidden state <math>\mathbf{z}(t)</math> for all <math>t</math>, which an neural ODE does not save on the forward pass. Luckily, both <math>\mathbf{a}(t)</math> and <math>\mathbf{z}(t)</math> can be calculated in reverse, at the same time, by setting up an augmented version of the dynamics and is shown in the final algorithm. Finally, the derivative <math>dL/d\theta</math> can be expressed in terms of the adjoint and the hidden state as, <br />
<br />
<div style="text-align:center;"><math> \frac{dL}{d\theta} -\int_{t_1}^{t_0} \mathbf{a}(t)^T\frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \theta}dt</math> (4).</div><br />
<br />
To obtain very inexpensive calculations of <math>\frac{\partial f}{\partial z}</math> and <math>\frac{\partial f}{\partial \theta}</math> in equation 3 and 4, automatic differentiation can be utilized. The authors present an algorithm to calculate the gradients of <math>L</math> and their dependent quantities with only one call to an ODE solver and is shown below. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Algorithm1.png|850px]]</div><br />
<br />
If the loss function has a stronger dependence on the hidden states for <math>t \neq t_0,t_1</math>, then Algorithm 1 can be modified to handle multiple calls to the ODESolve step since most ODE solvers have the capability to provide <math>z(t)</math> at arbitrary times. A visual depiction of this scenario is shown below. <br />
<br />
<div style="text-align:center;">[[File:NeuralODES Fig2.png|350px]]</div><br />
<br />
Please see the [https://arxiv.org/pdf/1806.07366.pdf#page=13 appendix] for extended versions of Algorithm 1 and detailed derivations of each equation in this section.<br />
<br />
== Replacing Residual Networks with ODEs for Supervised Learning ==<br />
Section three of the paper investigates an application of the reverse-mode differentiation described in section two, for the training of neural ODE networks on the MNIST digit data set. To solve for the forward pass in the neural ODE network, the following experiment used Adams-Moulton (AM) method, which is an implicit ODE solver. Although it has a marked improvement over explicit ODE solvers in numerical accuracy, integrating backward through the network for backpropagation is still not preferred and the adjoint sensitivity method is used to perform efficient weight optimization. The network with this "backpropagation" technique is referred to as ODE-Net in this section. <br />
<br />
=== Implementation ===<br />
A residual network (ResNet), studied by He et al. (2016), with six standard residual blocks was used as a comparative model for this experiment. The competing model, ODE-net, replaces the residual blocks of the ResNet with the AM solver. As a hybrid of the two models ResNet and ODE-net, a third network was created called RK-Net, which solves the weight optimization of the neural ODE network explicitly through backward Runge-Kutta integration. The following table shows the training and performance results of each network. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Table1.png|400px]]</div><br />
<br />
Note that <math>L</math> and <math>\tilde{L}</math> are the number of layers in ResNet and the number of function calls that the AM method makes for the two ODE networks and are effectively analogous quantities. As shown in Table 1, both of the ODE networks achieve comparable performance to that of the ResNet with a notable decrease in memory cost for ODE-net.<br />
<br />
<br />
Another interesting component of ODE networks is the ability to control the tolerance in the ODE solver used and subsequently the numerical error in the solution. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Fig3.png|700px]]</div><br />
<br />
The tolerance of the ODE solver is represented by the colour bar in Figure 3 above and notice that a variety of effects arise from adjusting this parameter. Primarily, if one was to treat the tolerance as a hyperparameter of sorts, you could tune it such that you find a balance between accuracy (Figure 3a) and computational complexity (Figure 3b). Figure 3c also provides further evidence for the benefits of the adjoint method for the backward pass in ODE-nets since there is a nearly 1:0.5 ratio of forward to backward function calls. In the ResNet and RK-Net examples, this ratio is 1:1.<br />
<br />
Additionally, the authors loosely define the concept of depth in a neural ODE network by referring to Figure 3d. Here it's evident that as you continue to train an ODE network, the number of function evaluations the ODE solver performs increases. As previously mentioned, this quantity is comparable to the network depth of a discretized network. However, as the authors note, this result should be seen as the progression of the network's complexity over training epochs, which is something we expect to increase over time.<br />
<br />
== Continuous Normalizing Flows ==<br />
<br />
Section four tackles the implementation of continuous-depth Neural Networks, but to do so, in the first part of section four the authors discuss theoretically how to establish this kind of network through the use of normalizing flows. The authors use a change of variables method presented in other works (Rezende and Mohamed, 2015), (Dinh et al., 2014), to compute the change of a probability distribution if sample points are transformed through a bijective function, <math>f</math>.<br />
<br />
<div style="text-align:center;"><math>z_1=f(z_0) \Rightarrow \log(p(z_1))=\log(p(z_0))-\log|\det\frac{\partial f}{\partial z_0}|</math></div><br />
<br />
Where p(z) is the probability distribution of the samples and <math>det\frac{\partial f}{\partial z_0}</math> is the determinant of the Jacobian which has a cubic cost in the dimension of '''z''' or the number of hidden units in the network. The authors discovered however that transforming the discrete set of hidden layers in the normalizing flow network to continuous transformations simplifies the computations significantly, due primarily to the following theorem:<br />
<br />
'''''Theorem 1:''' (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability p(z(t)) dependent on time. Let dz/dt=f(z(t),t) be a differential equation describing a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the change in log probability also follows a differential equation:''<br />
<br />
<div style="text-align:center;"><math>\frac{\partial \log(p(z(t)))}{\partial t}=-tr\left(\frac{df}{dz(t)}\right)</math></div><br />
<br />
The biggest advantage to using this theorem is that the trace function is a linear function, so if the dynamics of the problem, f, is represented by a sum of functions, then so is the log density. This essentially means that you can now compute flow models with only a linear cost with respect to the number of hidden units, <math>M</math>. In standard normalizing flow models, the cost is <math>O(M^3)</math>, so they will generally fit many layers with a single hidden unit in each layer.<br />
<br />
Finally the authors use these realizations to construct Continuous Normalizing Flow networks (CNFs) by specifying the parameters of the flow as a function of ''t'', ie, <math>f(z(t),t)</math>. They also use a gating mechanism for each hidden unit, <math>\frac{dz}{dt}=\sum_n \sigma_n(t)f_n(z)</math> where <math>\sigma_n(t)\in (0,1)</math> is a separate neural network which learns when to apply each dynamic <math>f_n</math>.<br />
<br />
===Implementation===<br />
<br />
The authors construct two separate types of neural networks to compare against each other, the first is the standard planar Normalizing Flow network (NF) using 64 layers of single hidden units, and the second is their new CNF with 64 hidden units. The NF model is trained over 500,000 iterations using RMSprop, and the CNF network is trained over 10,000 iterations using Adam(algorithm for first-order gradient-based optimization of stochastic objective functions). The loss function is <math>KL(q(x)||p(x))</math> where <math>q(x)</math> is the flow model and <math>p(x)</math> is the target probability density.<br />
<br />
One of the biggest advantages when implementing CNF is that you can train the flow parameters just by performing maximum likelihood estimation on <math>\log(q(x))</math> given <math>p(x)</math>, where <math>q(x)</math> is found via the theorem above, and then reversing the CNF to generate random samples from <math>q(x)</math>. This reversal of the CNF is done with about the same cost of the forward pass which is not able to be done in an NF network. The following two figures demonstrate the ability of CNF to generate more expressive and accurate output data as compared to standard NF networks.<br />
<br />
<div style="text-align:center;"><br />
[[Image:CNFcomparisons.png]]<br />
<br />
[[Image:CNFtransitions.png]]<br />
</div><br />
<br />
Figure 4 shows clearly that the CNF structure exhibits significantly lower loss functions than NF. In figure 5 both networks were tasked with transforming a standard Gaussian distribution into a target distribution, not only was the CNF network more accurate on the two moons target, but also the steps it took along the way are much more intuitive than the output from NF.<br />
<br />
== A Generative Latent Function Time-Series Model ==<br />
<br />
One of the largest issues at play in terms of Neural ODE networks is the fact that in many instances, data points are either very sparsely distributed, or irregularly-sampled. The latent dynamics are discretized and the observations are in the bins of fixed duration. This creates issues with missing data and ill-defined latent variables. An example of this is medical records which are only updated when a patient visits a doctor or the hospital. To solve this issue the authors had to create a generative time-series model which would be able to fill in the gaps of missing data. The authors consider each time series as a latent trajectory stemming from the initial local state <math>z_{t_0 }</math> and determined from a global set of latent parameters. Given a set of observation times and initial state, the generative model constructs points via the following sample procedure:<br />
<br />
<div style="text-align:center;"><br />
<math><br />
z_{t_0}∼p(z_{t_0}) <br />
</math><br />
</div> <br />
<br />
<div style="text-align:center;"><br />
<math><br />
z_{t_1},z_{t_2},\dots,z_{t_N}={\rm ODESolve}(z_{t_0},f,θ_f,t_0,...,t_N)<br />
</math><br />
</div><br />
<br />
<div style="text-align:center;"><br />
each <br />
<math><br />
x_{t_i}∼p(x│z_{t_i},θ_x)<br />
</math><br />
</div><br />
<br />
<math>f</math> is a function which outputs the gradient <math>\frac{\partial z(t)}{\partial t}=f(z(t),θ_f)</math> which is parameterized via a neural net. In order to train this latent variable model, the authors had to first encode their given data and observation times using an RNN encoder, construct the new points using the trained parameters, then decode the points back into the original space. The following figure describes this process:<br />
<br />
<div style="text-align:center;"><br />
[[Image:EncodingFigure.png]]<br />
</div><br />
<br />
Another variable which could affect the latent state of a time-series model is how often an event actually occurs. The authors solved this by parameterizing the rate of events in terms of a Poisson process. They described the set of independent observation times in an interval <math>\left[t_{start},t_{end}\right]</math> as:<br />
<br />
<div style="text-align:center;"> <br />
<math><br />
{\rm log}(p(t_1,t_2,\dots,t_N ))=\sum_{i=1}^N{\rm log}(\lambda(z(t_i)))-\int_{t_{start}}^{t_{end}}λ(z(t))dt<br />
</math><br />
</div><br />
<br />
where <math>\lambda(*)</math> is parameterized via another neural network.<br />
<br />
===Implementation===<br />
<br />
To test the effectiveness of the Latent time-series ODE model (LODE), they fit the encoder with 25 hidden units, parametrize function f with a one-layer 20 hidden unit network, and the decoder as another neural network with 20 hidden units. They compare this against a standard recurrent neural net (RNN) with 25 hidden units trained to minimize Gaussian log-likelihood. The authors tested both of these network systems on a dataset of 2-dimensional spirals which either rotated clockwise or counter-clockwise and sampled the positions of each spiral at 100 equally spaced time steps. They can then simulate irregularly timed data by taking random amounts of points without replacement from each spiral. The next two figures show the outcome of these experiments:<br />
<br />
<div style="text-align:center;"><br />
[[Image:LODEtestresults.png]] [[Image:SpiralFigure.png|The blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model]]<br />
</div><br />
<br />
In the figure on the right the blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model. It is noted that the LODE performs significantly better than the standard RNN model, especially on smaller sets of data points.<br />
<br />
== Scope and Limitations ==<br />
<br />
Section 6 mainly discusses the scope and limitations of the paper. Firstly, while “batching” the training data is a useful step in standard neural nets, and can still be applied here by combining the ODEs associated with each batch, the authors found that controlling the error, in this case, may increase the number of calculations required. In practice, however, the number of calculations did not increase significantly.<br />
<br />
So long as the model proposed in this paper uses finite weights and Lipschitz nonlinearities, Picard’s existence theorem (Coddington and Levinson, 1955) applies, which guarantees that the solution to the IVP exists and is unique. This theorem holds for the model presented above when the network has finite weights and uses nonlinearities in the Lipshitz class. <br />
<br />
In controlling the amount of error in the model, the authors were only able to reduce tolerances to approximately <math>10^{-3}</math> and <math>10^{-5}</math> in classification and density estimation respectively without also degrading the computational performance.<br />
<br />
The authors believe that reconstructing state trajectories by running the dynamics backwards can introduce extra numerical error. They address a possible solution to this problem by checkpointing certain time steps and storing intermediate values of z on the forward pass. Then while reconstructing, you do each part individually between checkpoints. The authors acknowledged that they informally checked the validity of this method since they don’t consider it a practical problem.<br />
<br />
There remain, however, areas where standard neural networks may perform better than Neural ODEs. Firstly, conventional nets can fit non-homeomorphic functions, for example, functions whose output has a smaller dimension that their input, or that change the topology of the input space. However, this could be handled by composing ODE nets with standard network layers. Another point is that conventional nets can be evaluated exactly with a fixed amount of computation, and are typically faster to train, and do not require an error tolerance for a solver.<br />
<br />
== Conclusions and Critiques ==<br />
<br />
We covered the use of black-box ODE solvers as a model component and their application to initial value problems constructed from real applications. Neural ODE Networks show promising gains in computational cost without large sacrifices in accuracy when applied to certain problems. A drawback of some of these implementations is that the ODE Neural Networks are limited by the underlying distributions of the problems they are trying to solve (requirement of Lipschitz continuity, etc.). There are plenty of further advances to be made in this field as hundreds of years of ODE theory and literature is available, so this is currently an important area of research.<br />
<br />
ODEs indeed represent an important area of applied mathematics where neural networks can be used to solve them numerically. Perhaps, a parallel area of investigation can be PDEs (Partial Differential Equations). PDEs are also widely encountered in many areas of applied mathematics, physics, social sciences and many other fields. It will be interesting to see how neural networks can be used to solve PDEs.<br />
<br />
== More Critiques ==<br />
<br />
This paper covers the memory efficiency of Neural ODE Networks, but does not address runtime. In practice, most systems are bound by latency requirements more-so than memory requirements (except in edge device cases). Though it may be unreasonable to expect the authors to produce a performance-optimized implementation, it would be insightful to understand the computational bottlenecks so existing frameworks can take steps to address them. This model looks promising and practical performance is the key to enabling future research in this.<br />
<br />
The above critique also questions the need for a neural network for such a problem. This problem was studied by Brunel et al. and they presented their solution in their paper ''Parametric Estimation of Ordinary Differential Equations with Orthogonality Conditions''. While this solution also requires iteratively solving a complex optimization problem, they did not require the massive memory and runtime overhead of a neural network. For the neural network solution to demonstrate its potential, it should be including experimental comparisons with specialized ordinary differential equation algorithms instead of simply comparing with a general recurrent neural network.<br />
<br />
== References ==<br />
Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. ''arXiv preprint arXiv'':1710.10121, 2017.<br />
<br />
Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. ''Inverse Problems'', 34 (1):014004, 2017.<br />
<br />
Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. ''arXiv preprint arXiv'':1804.04272, 2018.<br />
<br />
Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. ''The mathematical theory of optimal processes''. 1962.<br />
<br />
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ''European conference on computer vision'', pages 630–645. Springer, 2016b.<br />
<br />
Earl A Coddington and Norman Levinson. ''Theory of ordinary differential equations''. Tata McGrawHill Education, 1955.<br />
<br />
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. ''arXiv preprint arXiv:1505.05770'', 2015.<br />
<br />
Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. ''arXiv preprint arXiv:1410.8516'', 2014.<br />
<br />
Brunel, N. J., Clairon, Q., & d’Alché-Buc, F. (2014). Parametric estimation of ordinary differential equations with orthogonality conditions. ''Journal of the American Statistical Association'', 109(505), 173-185.<br />
<br />
A. Iserles. A first course in the numerical analysis of differential equations. Cambridge Texts in Applied Mathematics, second edition, 2009</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_ODEs&diff=46574Neural ODEs2020-11-26T05:21:33Z<p>A29mukhe: /* Replacing Residual Networks with ODEs for Supervised Learning */</p>
<hr />
<div>== Introduction ==<br />
Chen et al. propose a new class of neural networks called neural ordinary differential equations (ODEs) in their 2018 paper under the same title. Neural network models, such as residual or recurrent networks, can be generalized as a set of transformations through hidden states (a.k.a layers) <math>\mathbf{h}</math>, given by the equation <br />
<br />
<div style="text-align:center;"><math> \mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t,\theta_t) </math> (1) </div><br />
<br />
where <math>t \in \{0,...,T\}</math> and <math>\theta_t</math> corresponds to the set of parameters or weights in state <math>t</math>. It is important to note that it has been shown (Lu et al., 2017)(Haber<br />
and Ruthotto, 2017)(Ruthotto and Haber, 2018) that Equation 1 can be viewed as an Euler discretization. Given this Euler description, if the number of layers and step size between layers are taken to their limits, then Equation 1 can instead be described continuously in the form of the ODE, <br />
<br />
<div style="text-align:center;"><math> \frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t),t,\theta) </math> (2). </div><br />
<br />
Equation 2 now describes a network where the output layer <math>\mathbf{h}(T)</math> is generated by solving for the ODE at time <math>T</math>, given the initial value at <math>t=0</math>, where <math>\mathbf{h}(0)</math> is the input layer of the network. <br />
<br />
With a vast amount of theory and research in the field of solving ODEs numerically, there are a number of benefits to formulating the hidden state dynamics this way. One major advantage is that a continuous description of the network allows for the calculation of <math>f</math> at arbitrary intervals and locations. The authors provide an example in section five of how the neural ODE network outperforms the discretized version i.e. residual networks, by taking advantage of the continuity of <math>f</math>. A depiction of this distinction is shown in the figure below. <br />
<br />
<div style="text-align:center;"> [[File:NeuralODEs_Fig1.png|350px]] </div><br />
<br />
In section four the authors show that the single-unit bottleneck of normalizing flows can be overcome by constructing a new class of density models that incorporates the neural ODE network formulation.<br />
The next section on automatic differentiation will describe how utilizing ODE solvers allows for the calculation of gradients of the loss function without storing any of the hidden state information. This results in a very low memory requirement for neural ODE networks in comparison to traditional networks that rely on intermediate hidden state quantities for backpropagation.<br />
<br />
== Reverse-mode Automatic Differentiation of ODE Solutions ==<br />
Like most neural networks, optimizing the weight parameters <math>\theta</math> for a neural ODE network involves finding the gradient of a loss function with respect to those parameters. Differentiating in the forward direction is a simple task, however, this method is very computationally expensive and unstable, as it introduces additional numerical error. Instead, the authors suggest that the gradients can be calculated in the reverse-mode with the adjoint sensitivity method (Pontryagin et al., 1962). This "backpropagation" method solves an augmented version of the forward ODE problem but in reverse, which is something that all ODE solvers are capable of. Section 3 provides results showing that this method gives very desirable memory costs and numerical stability. <br />
<br />
The authors provide an example of the adjoint method by considering the minimization of the scalar-valued loss function <math>L</math>, which takes the solution of the ODE solver as its argument.<br />
<br />
<div style="text-align:center;">[[File:NeuralODEs_Eq1.png|700px]],</div> <br />
This minimization problem requires the calculation of <math>\frac{\partial L}{\partial \mathbf{z}(t_0)}</math> and <math>\frac{\partial L}{\partial \theta}</math>.<br />
<br />
The adjoint itself is defined as <math>\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{z}(t)}</math>, which describes the gradient of the loss with respect to the hidden state <math>\mathbf{z}(t)</math>. By taking the first derivative of the adjoint, another ODE arises in the form of,<br />
<br />
<div style="text-align:center;"><math>\frac{d \mathbf{a}(t)}{dt} = -\mathbf{a}(t)^T \frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \mathbf{z}}</math> (3).</div> <br />
<br />
Since the value <math>\mathbf{a}(t_0)</math> is required to minimize the loss, the ODE in equation 3 must be solved backwards in time from <math>\mathbf{a}(t_1)</math>. Solving this problem is dependent on the knowledge of the hidden state <math>\mathbf{z}(t)</math> for all <math>t</math>, which an neural ODE does not save on the forward pass. Luckily, both <math>\mathbf{a}(t)</math> and <math>\mathbf{z}(t)</math> can be calculated in reverse, at the same time, by setting up an augmented version of the dynamics and is shown in the final algorithm. Finally, the derivative <math>dL/d\theta</math> can be expressed in terms of the adjoint and the hidden state as, <br />
<br />
<div style="text-align:center;"><math> \frac{dL}{d\theta} -\int_{t_1}^{t_0} \mathbf{a}(t)^T\frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \theta}dt</math> (4).</div><br />
<br />
To obtain very inexpensive calculations of <math>\frac{\partial f}{\partial z}</math> and <math>\frac{\partial f}{\partial \theta}</math> in equation 3 and 4, automatic differentiation can be utilized. The authors present an algorithm to calculate the gradients of <math>L</math> and their dependent quantities with only one call to an ODE solver and is shown below. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Algorithm1.png|850px]]</div><br />
<br />
If the loss function has a stronger dependence on the hidden states for <math>t \neq t_0,t_1</math>, then Algorithm 1 can be modified to handle multiple calls to the ODESolve step since most ODE solvers have the capability to provide <math>z(t)</math> at arbitrary times. A visual depiction of this scenario is shown below. <br />
<br />
<div style="text-align:center;">[[File:NeuralODES Fig2.png|350px]]</div><br />
<br />
Please see the [https://arxiv.org/pdf/1806.07366.pdf#page=13 appendix] for extended versions of Algorithm 1 and detailed derivations of each equation in this section.<br />
<br />
== Replacing Residual Networks with ODEs for Supervised Learning ==<br />
Section three of the paper investigates an application of the reverse-mode differentiation described in section two, for the training of neural ODE networks on the MNIST digit data set. To solve for the forward pass in the neural ODE network, the following experiment used Adams-Moulton (AM) method, which is an implicit ODE solver. Although it has a marked improvement over explicit ODE solvers in numerical accuracy, integrating backward through the network for backpropagation is still not preferred and the adjoint sensitivity method is used to perform efficient weight optimization. The network with this "backpropagation" technique is referred to as ODE-Net in this section. <br />
<br />
=== Implementation ===<br />
A residual network (ResNet), studied by He et al. (2016), with six standard residual blocks was used as a comparative model for this experiment. The competing model, ODE-net, replaces the residual blocks of the ResNet with the AM solver. As a hybrid of the two models ResNet and ODE-net, a third network was created called RK-Net, which solves the weight optimization of the neural ODE network explicitly through backward Runge-Kutta integration. The following table shows the training and performance results of each network. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Table1.png|400px]]</div><br />
<br />
Note that <math>L</math> and <math>\tilde{L}</math> are the number of layers in ResNet and the number of function calls that the AM method makes for the two ODE networks and are effectively analogous quantities. As shown in Table 1, both of the ODE networks achieve comparable performance to that of the ResNet with a notable decrease in memory cost for ODE-net.<br />
<br />
<br />
Another interesting component of ODE networks is the ability to control the tolerance in the ODE solver used and subsequently the numerical error in the solution. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Fig3.png|700px]]</div><br />
<br />
The tolerance of the ODE solver is represented by the colour bar in Figure 3 above and notice that a variety of effects arise from adjusting this parameter. Primarily, if one was to treat the tolerance as a hyperparameter of sorts, you could tune it such that you find a balance between accuracy (Figure 3a) and computational complexity (Figure 3b). Figure 3c also provides further evidence for the benefits of the adjoint method for the backward pass in ODE-nets since there is a nearly 1:0.5 ratio of forward to backward function calls. In the ResNet and RK-Net examples, this ratio is 1:1.<br />
<br />
Additionally, the authors loosely define the concept of depth in a neural ODE network by referring to Figure 3d. Here it's evident that as you continue to train an ODE network, the number of function evaluations the ODE solver performs increases. As previously mentioned, this quantity is comparable to the network depth of a discretized network. However, as the authors note, this result should be seen as the progression of the network's complexity over training epochs, which is something we expect to increase over time.<br />
<br />
== Continuous Normalizing Flows ==<br />
<br />
Section four tackles the implementation of continuous-depth Neural Networks, but to do so, in the first part of section four the authors discuss theoretically how to establish this kind of network through the use of normalizing flows. The authors use a change of variables method presented in other works (Rezende and Mohamed, 2015), (Dinh et al., 2014), to compute the change of a probability distribution if sample points are transformed through a bijective function, <math>f</math>.<br />
<br />
<div style="text-align:center;"><math>z_1=f(z_0) \Rightarrow \log(p(z_1))=\log(p(z_0))-\log|\det\frac{\partial f}{\partial z_0}|</math></div><br />
<br />
Where p(z) is the probability distribution of the samples and <math>det\frac{\partial f}{\partial z_0}</math> is the determinant of the Jacobian which has a cubic cost in the dimension of '''z''' or the number of hidden units in the network. The authors discovered however that transforming the discrete set of hidden layers in the normalizing flow network to continuous transformations simplifies the computations significantly, due primarily to the following theorem:<br />
<br />
'''''Theorem 1:''' (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability p(z(t)) dependent on time. Let dz/dt=f(z(t),t) be a differential equation describing a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the change in log probability also follows a differential equation:''<br />
<br />
<div style="text-align:center;"><math>\frac{\partial \log(p(z(t)))}{\partial t}=-tr\left(\frac{df}{dz(t)}\right)</math></div><br />
<br />
The biggest advantage to using this theorem is that the trace function is a linear function, so if the dynamics of the problem, f, is represented by a sum of functions, then so is the log density. This essentially means that you can now compute flow models with only a linear cost with respect to the number of hidden units, <math>M</math>. In standard normalizing flow models, the cost is <math>O(M^3)</math>, so they will generally fit many layers with a single hidden unit in each layer.<br />
<br />
Finally the authors use these realizations to construct Continuous Normalizing Flow networks (CNFs) by specifying the parameters of the flow as a function of ''t'', ie, <math>f(z(t),t)</math>. They also use a gating mechanism for each hidden unit, <math>\frac{dz}{dt}=\sum_n \sigma_n(t)f_n(z)</math> where <math>\sigma_n(t)\in (0,1)</math> is a separate neural network which learns when to apply each dynamic <math>f_n</math>.<br />
<br />
===Implementation===<br />
<br />
The authors construct two separate types of neural networks to compare against each other, the first is the standard planar Normalizing Flow network (NF) using 64 layers of single hidden units, and the second is their new CNF with 64 hidden units. The NF model is trained over 500,000 iterations using RMSprop, and the CNF network is trained over 10,000 iterations using Adam(algorithm for first-order gradient-based optimization of stochastic objective functions). The loss function is <math>KL(q(x)||p(x))</math> where <math>q(x)</math> is the flow model and <math>p(x)</math> is the target probability density.<br />
<br />
One of the biggest advantages when implementing CNF is that you can train the flow parameters just by performing maximum likelihood estimation on <math>\log(q(x))</math> given <math>p(x)</math>, where <math>q(x)</math> is found via the theorem above, and then reversing the CNF to generate random samples from <math>q(x)</math>. This reversal of the CNF is done with about the same cost of the forward pass which is not able to be done in an NF network. The following two figures demonstrate the ability of CNF to generate more expressive and accurate output data as compared to standard NF networks.<br />
<br />
<div style="text-align:center;"><br />
[[Image:CNFcomparisons.png]]<br />
<br />
[[Image:CNFtransitions.png]]<br />
</div><br />
<br />
Figure 4 shows clearly that the CNF structure exhibits significantly lower loss functions than NF. In figure 5 both networks were tasked with transforming a standard Gaussian distribution into a target distribution, not only was the CNF network more accurate on the two moons target, but also the steps it took along the way are much more intuitive than the output from NF.<br />
<br />
== A Generative Latent Function Time-Series Model ==<br />
<br />
One of the largest issues at play in terms of Neural ODE networks is the fact that in many instances, data points are either very sparsely distributed, or irregularly-sampled. The latent dynamics are discretized and the observations are in the bins of fixed duration. This creates issues with missing data and ill-defined latent variables. An example of this is medical records which are only updated when a patient visits a doctor or the hospital. To solve this issue the authors had to create a generative time-series model which would be able to fill in the gaps of missing data. The authors consider each time series as a latent trajectory stemming from the initial local state <math>z_{t_0 }</math> and determined from a global set of latent parameters. Given a set of observation times and initial state, the generative model constructs points via the following sample procedure:<br />
<br />
<div style="text-align:center;"><br />
<math><br />
z_{t_0}∼p(z_{t_0}) <br />
</math><br />
</div> <br />
<br />
<div style="text-align:center;"><br />
<math><br />
z_{t_1},z_{t_2},\dots,z_{t_N}={\rm ODESolve}(z_{t_0},f,θ_f,t_0,...,t_N)<br />
</math><br />
</div><br />
<br />
<div style="text-align:center;"><br />
each <br />
<math><br />
x_{t_i}∼p(x│z_{t_i},θ_x)<br />
</math><br />
</div><br />
<br />
<math>f</math> is a function which outputs the gradient <math>\frac{\partial z(t)}{\partial t}=f(z(t),θ_f)</math> which is parameterized via a neural net. In order to train this latent variable model, the authors had to first encode their given data and observation times using an RNN encoder, construct the new points using the trained parameters, then decode the points back into the original space. The following figure describes this process:<br />
<br />
<div style="text-align:center;"><br />
[[Image:EncodingFigure.png]]<br />
</div><br />
<br />
Another variable which could affect the latent state of a time-series model is how often an event actually occurs. The authors solved this by parameterizing the rate of events in terms of a Poisson process. They described the set of independent observation times in an interval <math>\left[t_{start},t_{end}\right]</math> as:<br />
<br />
<div style="text-align:center;"> <br />
<math><br />
{\rm log}(p(t_1,t_2,\dots,t_N ))=\sum_{i=1}^N{\rm log}(\lambda(z(t_i)))-\int_{t_{start}}^{t_{end}}λ(z(t))dt<br />
</math><br />
</div><br />
<br />
where <math>\lambda(*)</math> is parameterized via another neural network.<br />
<br />
===Implementation===<br />
<br />
To test the effectiveness of the Latent time-series ODE model (LODE), they fit the encoder with 25 hidden units, parametrize function f with a one-layer 20 hidden unit network, and the decoder as another neural network with 20 hidden units. They compare this against a standard recurrent neural net (RNN) with 25 hidden units trained to minimize Gaussian log-likelihood. The authors tested both of these network systems on a dataset of 2-dimensional spirals which either rotated clockwise or counter-clockwise and sampled the positions of each spiral at 100 equally spaced time steps. They can then simulate irregularly timed data by taking random amounts of points without replacement from each spiral. The next two figures show the outcome of these experiments:<br />
<br />
<div style="text-align:center;"><br />
[[Image:LODEtestresults.png]] [[Image:SpiralFigure.png|The blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model]]<br />
</div><br />
<br />
In the figure on the right the blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model. It is noted that the LODE performs significantly better than the standard RNN model, especially on smaller sets of data points.<br />
<br />
== Scope and Limitations ==<br />
<br />
Section 6 mainly discusses the scope and limitations of the paper. Firstly, while “batching” the training data is a useful step in standard neural nets, and can still be applied here by combining the ODEs associated with each batch, the authors found that controlling the error, in this case, may increase the number of calculations required. In practice, however, the number of calculations did not increase significantly.<br />
<br />
So long as the model proposed in this paper uses finite weights and Lipschitz nonlinearities, Picard’s existence theorem (Coddington and Levinson, 1955) applies, which guarantees that the solution to the IVP exists and is unique. This theorem holds for the model presented above when the network has finite weights and uses nonlinearities in the Lipshitz class. <br />
<br />
In controlling the amount of error in the model, the authors were only able to reduce tolerances to approximately <math>10^{-3}</math> and <math>10^{-5}</math> in classification and density estimation respectively without also degrading the computational performance.<br />
<br />
The authors believe that reconstructing state trajectories by running the dynamics backwards can introduce extra numerical error. They address a possible solution to this problem by checkpointing certain time steps and storing intermediate values of z on the forward pass. Then while reconstructing, you do each part individually between checkpoints. The authors acknowledged that they informally checked the validity of this method since they don’t consider it a practical problem.<br />
<br />
There remain, however, areas where standard neural networks may perform better than Neural ODEs. Firstly, conventional nets can fit non-homeomorphic functions, for example, functions whose output has a smaller dimension that their input, or that change the topology of the input space. However, this could be handled by composing ODE nets with standard network layers. Another point is that conventional nets can be evaluated exactly with a fixed amount of computation, and are typically faster to train, and do not require an error tolerance for a solver.<br />
<br />
== Conclusions and Critiques ==<br />
<br />
We covered the use of black-box ODE solvers as a model component and their application to initial value problems constructed from real applications. Neural ODE Networks show promising gains in computational cost without large sacrifices in accuracy when applied to certain problems. A drawback of some of these implementations is that the ODE Neural Networks are limited by the underlying distributions of the problems they are trying to solve (requirement of Lipschitz continuity, etc.). There are plenty of further advances to be made in this field as hundreds of years of ODE theory and literature is available, so this is currently an important area of research.<br />
<br />
ODEs indeed represent an important area of applied mathematics where neural networks can be used to solve them numerically. Perhaps, a parallel area of investigation can be PDEs (Partial Differential Equations). PDEs are also widely encountered in many areas of applied mathematics, physics, social sciences and many other fields. It will be interesting to see how neural networks can be used to solve PDEs.<br />
<br />
== More Critiques ==<br />
<br />
This paper covers the memory efficiency of Neural ODE Networks, but does not address runtime. In practice, most systems are bound by latency requirements more-so than memory requirements (except in edge device cases). Though it may be unreasonable to expect the authors to produce a performance-optimized implementation, it would be insightful to understand the computational bottlenecks so existing frameworks can take steps to address them. This model looks promising and practical performance is the key to enabling future research in this.<br />
<br />
The above critique also questions the need for a neural network for such a problem. This problem was studied by Brunel et al. and they presented their solution in their paper ''Parametric Estimation of Ordinary Differential Equations with Orthogonality Conditions''. While this solution also requires iteratively solving a complex optimization problem, they did not require the massive memory and runtime overhead of a neural network. For the neural network solution to demonstrate its potential, it should be including experimental comparisons with specialized ordinary differential equation algorithms instead of simply comparing with a general recurrent neural network.<br />
<br />
== References ==<br />
Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. ''arXiv preprint arXiv'':1710.10121, 2017.<br />
<br />
Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. ''Inverse Problems'', 34 (1):014004, 2017.<br />
<br />
Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. ''arXiv preprint arXiv'':1804.04272, 2018.<br />
<br />
Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. ''The mathematical theory of optimal processes''. 1962.<br />
<br />
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ''European conference on computer vision'', pages 630–645. Springer, 2016b.<br />
<br />
Earl A Coddington and Norman Levinson. ''Theory of ordinary differential equations''. Tata McGrawHill Education, 1955.<br />
<br />
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. ''arXiv preprint arXiv:1505.05770'', 2015.<br />
<br />
Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. ''arXiv preprint arXiv:1410.8516'', 2014.<br />
<br />
Brunel, N. J., Clairon, Q., & d’Alché-Buc, F. (2014). Parametric estimation of ordinary differential equations with orthogonality conditions. ''Journal of the American Statistical Association'', 109(505), 173-185.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=44434Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-15T09:02:36Z<p>A29mukhe: /* Model Architecture */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to the innovation of healthcare, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
[[File:architecture.png | thumb | center | 1000px | Model Architecture (Gupta et al., 2019)]]<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labeled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=44433Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-15T09:02:20Z<p>A29mukhe: /* Model Architecture */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to the innovation of healthcare, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
[[File:architecture.png | thumb | center | 800px | Model Architecture (Gupta et al., 2019)]]<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labeled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=44432Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-15T09:00:14Z<p>A29mukhe: /* Model Architecture */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to the innovation of healthcare, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
[[File:architecture.png | thumb | center | 800px | e]]<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labeled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=44431Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-15T09:00:07Z<p>A29mukhe: /* Model Architecture */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to the innovation of healthcare, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
[[File:architecture.png | thumb | center | 400px | e]]<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labeled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=44430Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-15T08:59:57Z<p>A29mukhe: /* Model Architecture */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to the innovation of healthcare, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
[[File:architecture.png | thumb | center | 50px | e]]<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labeled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=44429Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-15T08:59:10Z<p>A29mukhe: /* Model Architecture */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to the innovation of healthcare, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
[[File:architecture.png | frame | center | 50px | e]]<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labeled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=44428Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-15T08:59:02Z<p>A29mukhe: /* Model Architecture */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to the innovation of healthcare, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
[[File:architecture.png | frame | center | 100px | e]]<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labeled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=44427Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-15T08:58:51Z<p>A29mukhe: /* Model Architecture */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to the innovation of healthcare, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
[[File:architecture.png | frame | center | 200px | e]]<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labeled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=44426Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-15T08:56:48Z<p>A29mukhe: /* Model Architecture */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to the innovation of healthcare, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
[[File:architecture.png]]<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labeled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=44425Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-15T08:56:36Z<p>A29mukhe: /* Model Architecture */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to the innovation of healthcare, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
[[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:architecture.png]]<br />
[[File:architecture.png]]<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labeled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=44424Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-15T08:56:07Z<p>A29mukhe: /* Model Architecture */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to the innovation of healthcare, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
[[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:architecture.png]]<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labeled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:architecture.png&diff=44423File:architecture.png2020-11-15T08:54:21Z<p>A29mukhe: </p>
<hr />
<div></div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:https---raw.githubusercontent.com-arjung128-mi_detection-master-architecture.png&diff=44422File:https---raw.githubusercontent.com-arjung128-mi detection-master-architecture.png2020-11-15T08:53:01Z<p>A29mukhe: </p>
<hr />
<div></div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=44421Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-15T08:52:07Z<p>A29mukhe: /* Model Architecture */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to the innovation of healthcare, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
[[File:https://raw.githubusercontent.com/arjung128/mi_detection/master/architecture.png]]<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a the neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors effecting model performance evaluation: 1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records effecting the training accuracy, they used 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result lower accuracy evaluation, however, it still maintain an high average of 97.83%. <br />
<br />
==Discussion & Conclusion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to larger labeled data set is needed. The PTB database is small so it is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=43286Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-04T00:00:36Z<p>A29mukhe: Added Model Architecture</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Zhao, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to the detection of heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake, in the area of scientific machine learning. A deep learning approach was used due to the model’s ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbours, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo, and Attarodi [30] used various techniques such as ANN, K-nearest neighbours, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [32] used SVM and K-nearest neighbours as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [23] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set and test set with ratios 80-10-10.<br />
<br />
==Result== <br />
<br />
1. Quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii<br />
<br />
2. Quantification of accuracies for pairs of top 5 highest individual channels with 20-fold cross-validation, resulting highest pairs accuracies to fed into a the neural network: lead v6 and lead vz <br />
<br />
3. Use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly.<br />
<br />
4. Discussing 2 factors effecting model performance evaluation:<br />
<br />
1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion <br />
<br />
2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split <br />
<br />
5. Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision<br />
<br />
6. Further using 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz as record-wise split <br />
<br />
1） Discuss patient-wise split might result lower accuracy evaluation, however, it still maintain an average of 97.83%<br />
<br />
==Muyuan==<br />
<br />
Discussion <br />
<br />
1.The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. <br />
<br />
2.This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz and ii contain the most meaningful information for detecting myocardial infraction.<br />
<br />
3.This study also finds that recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate.<br />
<br />
4.To further improve the performance of the models, access to larger labelled data set is needed.<br />
<br />
5.The PTB database is small. It is difficult to test the true robustness of the model with a relatively small test set.<br />
<br />
6.If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
Conclusion<br />
<br />
1.A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches cardiologist-level success rate.<br />
<br />
2.Deep learning algorithms may be readily used as commodity software. Neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction.<br />
<br />
3.Deep learning does not require feature engineering of ECG data to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. <br />
<br />
4.Access to larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and cardiology community can work together to develop deep learning algorithms that provides trustworthy, real-time information regarding heart conditions with minimal computational resources.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=42829Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-10-26T00:24:59Z<p>A29mukhe: </p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Zhao, Amartya (Marty) Mukherjee<br />
<br />
== Maggie ==<br />
<br />
Introduction<br />
<br />
Problem: To detect risk of heart disease from ECG signals.<br />
<br />
Importance: To provide a correct diagnosis so that proper health care can be given to patients.<br />
<br />
Area: Deep learning – scientific machine learning<br />
<br />
o Concerned about design, training, and use of ML algorithm in an optimal matter towards a certain problem.<br />
<br />
Deep learning model benefits:<br />
<br />
o Uses multiple GPUs to construct complicated NN that can be trained using TB-sized datasets, which are robust against noise.<br />
<br />
o After training is complete, it takes little computational power to conduct statistical inference.<br />
<br />
Question that need to be answered:<br />
<br />
o What AI architecture and datasets can provide evidence of symptoms, ECG changes, imaging evidence that shows loss of viable myocardium and wall motion abnormality?<br />
<br />
Purpose of the article:<br />
<br />
o To provide a detailed analysis on the contribution of each ECG lead in identifying heart disease in the model. This is because the selection of data in previous studies regarding heart disease made data selection arbitrary in identifying heart conditions.<br />
<br />
o To show the use of using multiple data channels of information to enhance prediction accuracy in deep learning, i.e. processing the top three ECG leads simultaneously in the neural network.<br />
<br />
o Show that feature engineering is not necessary in the training, validation, or testing process for ECG data in neural networks.<br />
<br />
Related work<br />
<br />
Database used: PTB database<br />
<br />
1. CNN Network<br />
<br />
a. Used both noisy and denoised ECGs without feature engineering.<br />
<br />
2. Used artificial neural network, probabilistic neural network, KNN, multi-layer perceptron, and Naïve Bayes Classification<br />
<br />
a. Extracted two features: T-wave integral and total integral to identify heart disease.<br />
<br />
3. Developed two different ANN: RBF and MLP<br />
<br />
4. Supervised learning techniques have limited success to the problem and used multiple instance learning.<br />
<br />
a. Demonstrate proposed algorithm LTMIL surpasses supervised approaches.<br />
<br />
5. Create new feature by approximating ECG signal using a 20th order polynomial, which achieved 94.4% accuracy.<br />
<br />
6. Stationary wavelet transforms to decompose ECG into sub-bands.<br />
<br />
a. SVM and KNN used to classify.<br />
<br />
b. Features used: sample entropy, normalized sub-bands, log energy entropy, median slope from sub-bands.<br />
<br />
7. Transfer learning – used deep CNN model for arrhythmia and developed it to detect heart disease.<br />
<br />
8. Simple adaptive threshold (SAT)<br />
<br />
a. Multiresolution approach, adaptive thresholding used to extract features: depth of Q peak and elevation in ST segment<br />
<br />
9. Subject-oriented approach using CNN to take in leads II, III, AVF<br />
<br />
10. Model ECG using 2nd order ODE and feed the best-fitting coefficients of the ECG signal into a SVM.<br />
<br />
11. Multi-channel CNN (16 layers) with long-short memory units<br />
<br />
12. Deep CNN that takes 3 seconds at a time of lead II as input<br />
<br />
Most use feature extraction/selection from the raw ECG data before training.<br />
<br />
o Problem with feature selection is that it is not practical for large volumes of data.<br />
<br />
Other papers that do not use feature selection arbitrarily picks ECG leads for classification and does not provide rationale.<br />
<br />
==Betty==<br />
<br />
Result: <br />
<br />
1. Quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii<br />
<br />
2. Quantification of accuracies for pairs of top 5 highest individual channels with 20-fold cross-validation, resulting highest pairs accuracies to fed into a the neural network: lead v6 and lead vz <br />
<br />
3. Use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly.<br />
<br />
4. Discussing 2 factors effecting model performance evaluation:<br />
<br />
1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion <br />
<br />
2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split <br />
<br />
5. Comparing with other models in other 12 papers, the model in this article has the highest accuracy, specificity, and precision<br />
<br />
6. Further using 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz as record-wise split <br />
<br />
1） Discuss patient-wise split might result lower accuracy evaluation, however, it still maintain an average of 97.83%<br />
<br />
== Marty ==<br />
<br />
3.1: Data curation<br />
<br />
Dataset: 549 ECG records total<br />
<br />
290 unique patients<br />
<br />
Each ECG record has a mean length of over 100s<br />
<br />
3.2: ANN model<br />
<br />
ConvNetQuake model + 1D batch normalization + Label-smoothing<br />
<br />
Model (PyTorch):<br />
<br />
- Input layer: 10-second long ECG signal<br />
<br />
- Hidden layers: 8 * (1D convolution layer, Activation function: RELU, 1D batch normalization layer)<br />
<br />
- Output layer: 1280 dimensions -> 1 dimension, Activation function: Sigmoid<br />
<br />
Batch size = 10<br />
<br />
Learning rate = 10^-4<br />
<br />
Optimizer = ADAM<br />
<br />
80-10-10: Train-Validation-Test</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=42822Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-10-26T00:19:42Z<p>A29mukhe: </p>
<hr />
<div><br />
== Maggie ==<br />
<br />
Introduction<br />
<br />
Problem: To detect risk of heart disease from ECG signals.<br />
<br />
Importance: To provide a correct diagnosis so that proper health care can be given to patients.<br />
<br />
Area: Deep learning – scientific machine learning<br />
<br />
o Concerned about design, training, and use of ML algorithm in an optimal matter towards a certain problem.<br />
<br />
Deep learning model benefits:<br />
<br />
o Uses multiple GPUs to construct complicated NN that can be trained using TB-sized datasets, which are robust against noise.<br />
<br />
o After training is complete, it takes little computational power to conduct statistical inference.<br />
<br />
Question that need to be answered:<br />
<br />
o What AI architecture and datasets can provide evidence of symptoms, ECG changes, imaging evidence that shows loss of viable myocardium and wall motion abnormality?<br />
<br />
Purpose of the article:<br />
<br />
o To provide a detailed analysis on the contribution of each ECG lead in identifying heart disease in the model. This is because the selection of data in previous studies regarding heart disease made data selection arbitrary in identifying heart conditions.<br />
<br />
o To show the use of using multiple data channels of information to enhance prediction accuracy in deep learning, i.e. processing the top three ECG leads simultaneously in the neural network.<br />
<br />
o Show that feature engineering is not necessary in the training, validation, or testing process for ECG data in neural networks.<br />
<br />
Related work<br />
<br />
Database used: PTB database<br />
<br />
1. CNN Network<br />
<br />
a. Used both noisy and denoised ECGs without feature engineering.<br />
<br />
2. Used artificial neural network, probabilistic neural network, KNN, multi-layer perceptron, and Naïve Bayes Classification<br />
<br />
a. Extracted two features: T-wave integral and total integral to identify heart disease.<br />
<br />
3. Developed two different ANN: RBF and MLP<br />
<br />
4. Supervised learning techniques have limited success to the problem and used multiple instance learning.<br />
<br />
a. Demonstrate proposed algorithm LTMIL surpasses supervised approaches.<br />
<br />
5. Create new feature by approximating ECG signal using a 20th order polynomial, which achieved 94.4% accuracy.<br />
<br />
6. Stationary wavelet transforms to decompose ECG into sub-bands.<br />
<br />
a. SVM and KNN used to classify.<br />
<br />
b. Features used: sample entropy, normalized sub-bands, log energy entropy, median slope from sub-bands.<br />
<br />
7. Transfer learning – used deep CNN model for arrhythmia and developed it to detect heart disease.<br />
<br />
8. Simple adaptive threshold (SAT)<br />
<br />
a. Multiresolution approach, adaptive thresholding used to extract features: depth of Q peak and elevation in ST segment<br />
<br />
9. Subject-oriented approach using CNN to take in leads II, III, AVF<br />
<br />
10. Model ECG using 2nd order ODE and feed the best-fitting coefficients of the ECG signal into a SVM.<br />
<br />
11. Multi-channel CNN (16 layers) with long-short memory units<br />
<br />
12. Deep CNN that takes 3 seconds at a time of lead II as input<br />
<br />
Most use feature extraction/selection from the raw ECG data before training.<br />
<br />
o Problem with feature selection is that it is not practical for large volumes of data.<br />
<br />
Other papers that do not use feature selection arbitrarily picks ECG leads for classification and does not provide rationale.<br />
<br />
<br />
==Betty==<br />
<br />
Result: <br />
<br />
1. Quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii<br />
<br />
2. Quantification of accuracies for pairs of top 5 highest individual channels with 20-fold cross-validation, resulting highest pairs accuracies to fed into a the neural network: lead v6 and lead vz <br />
<br />
3. Use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly.<br />
<br />
4. Discussing 2 factors effecting model performance evaluation:<br />
<br />
1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion <br />
<br />
2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split <br />
<br />
5. Comparing with other models, the model in this article has the highest accuracy, specificity, and precision<br />
<br />
6. Further using 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz as record-wise split <br />
<br />
1） Discuss patient-wise split might result lower accuracy evaluation, however, it still maintain an average of 97.83%<br />
<br />
== Marty ==<br />
<br />
3.1: Data curation<br />
<br />
Dataset: 549 ECG records total<br />
<br />
290 unique patients<br />
<br />
Each ECG record has a mean length of over 100s<br />
<br />
3.2: ANN model<br />
<br />
ConvNetQuake model + 1D batch normalization + Label-smoothing<br />
<br />
Model (PyTorch):<br />
<br />
- Input layer: 10-second long ECG signal<br />
<br />
- Hidden layers: 8 * (1D convolution layer, Activation function: RELU, 1D batch normalization layer)<br />
<br />
- Output layer: 1280 dimensions -> 1 dimension, Activation function: Sigmoid<br />
<br />
Batch size = 10<br />
<br />
Learning rate = 10^-4<br />
<br />
Optimizer = ADAM<br />
<br />
80-10-10: Train-Validation-Test</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=42821Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-10-26T00:18:24Z<p>A29mukhe: </p>
<hr />
<div><br />
== Maggie ==<br />
<br />
Introduction<br />
<br />
Problem: To detect risk of heart disease from ECG signals.<br />
<br />
Importance: To provide a correct diagnosis so that proper health care can be given to patients.<br />
<br />
Area: Deep learning – scientific machine learning<br />
<br />
o Concerned about design, training, and use of ML algorithm in an optimal matter towards a certain problem.<br />
<br />
Deep learning model benefits:<br />
<br />
o Uses multiple GPUs to construct complicated NN that can be trained using TB-sized datasets, which are robust against noise.<br />
<br />
o After training is complete, it takes little computational power to conduct statistical inference.<br />
<br />
Question that need to be answered:<br />
<br />
o What AI architecture and datasets can provide evidence of symptoms, ECG changes, imaging evidence that shows loss of viable myocardium and wall motion abnormality?<br />
<br />
Purpose of the article:<br />
<br />
o To provide a detailed analysis on the contribution of each ECG lead in identifying heart disease in the model. This is because the selection of data in previous studies regarding heart disease made data selection arbitrary in identifying heart conditions.<br />
<br />
o To show the use of using multiple data channels of information to enhance prediction accuracy in deep learning, i.e. processing the top three ECG leads simultaneously in the neural network.<br />
<br />
o Show that feature engineering is not necessary in the training, validation, or testing process for ECG data in neural networks.<br />
<br />
Related work<br />
<br />
Database used: PTB database<br />
<br />
1. CNN Network<br />
<br />
a. Used both noisy and denoised ECGs without feature engineering.<br />
<br />
2. Used artificial neural network, probabilistic neural network, KNN, multi-layer perceptron, and Naïve Bayes Classification<br />
<br />
a. Extracted two features: T-wave integral and total integral to identify heart disease.<br />
<br />
3. Developed two different ANN: RBF and MLP<br />
<br />
4. Supervised learning techniques have limited success to the problem and used multiple instance learning.<br />
<br />
a. Demonstrate proposed algorithm LTMIL surpasses supervised approaches.<br />
<br />
5. Create new feature by approximating ECG signal using a 20th order polynomial, which achieved 94.4% accuracy.<br />
<br />
6. Stationary wavelet transforms to decompose ECG into sub-bands.<br />
<br />
a. SVM and KNN used to classify.<br />
<br />
b. Features used: sample entropy, normalized sub-bands, log energy entropy, median slope from sub-bands.<br />
<br />
7. Transfer learning – used deep CNN model for arrhythmia and developed it to detect heart disease.<br />
<br />
8. Simple adaptive threshold (SAT)<br />
<br />
a. Multiresolution approach, adaptive thresholding used to extract features: depth of Q peak and elevation in ST segment<br />
<br />
9. Subject-oriented approach using CNN to take in leads II, III, AVF<br />
<br />
10. Model ECG using 2nd order ODE and feed the best-fitting coefficients of the ECG signal into a SVM.<br />
<br />
11. Multi-channel CNN (16 layers) with long-short memory units<br />
<br />
12. Deep CNN that takes 3 seconds at a time of lead II as input<br />
<br />
Most use feature extraction/selection from the raw ECG data before training.<br />
<br />
o Problem with feature selection is that it is not practical for large volumes of data.<br />
<br />
Other papers that do not use feature selection arbitrarily picks ECG leads for classification and does not provide rationale.<br />
<br />
<br />
==Betty==<br />
<br />
Result: <br />
<br />
1. Quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii<br />
<br />
2. Quantification of accuracies for pairs of top 5 highest individual channels with 20-fold cross-validation, resulting highest pairs accuracies to fed into a the neural network: lead v6 and lead vz <br />
<br />
3. Use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly.<br />
<br />
4. Discussing 2 factors effecting model performance evaluation:<br />
<br />
1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion <br />
<br />
2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split <br />
<br />
5. Comparing with other models, the model in this article has the highest accuracy, specificity, and precision<br />
<br />
6. Further using 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz as record-wise split <br />
<br />
1） Discuss patient-wise split might result lower accuracy evaluation, however, it still maintain an average of 97.83%<br />
<br />
== Marty ==<br />
<br />
3.1: Data curation<br />
<br />
Dataset: 549 ECG records total<br />
290 unique patients <br />
Each ECG record has a mean length of over 100s<br />
<br />
3.2: ANN model<br />
<br />
ConvNetQuake model + 1D batch normalization + Label-smoothing<br />
<br />
Model (PyTorch):<br />
- Input layer: 10-second long ECG signal<br />
- Hidden layers: 8 * (1D convolution layer, Activation function: RELU, 1D batch normalization layer)<br />
- Output layer: 1280 dimensions -> 1 dimension, Activation function: Sigmoid<br />
<br />
Batch size = 10<br />
Learning rate = 10^-4 Optimizer = ADAM<br />
<br />
80-10-10: Train-Validation-Test</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=42819Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-10-26T00:17:35Z<p>A29mukhe: </p>
<hr />
<div><br />
== Maggie ==<br />
<br />
Introduction<br />
Problem: To detect risk of heart disease from ECG signals.<br />
Importance: To provide a correct diagnosis so that proper health care can be given to patients.<br />
Area: Deep learning – scientific machine learning<br />
o Concerned about design, training, and use of ML algorithm in an optimal matter towards a certain problem.<br />
Deep learning model benefits:<br />
o Uses multiple GPUs to construct complicated NN that can be trained using TB-sized datasets, which are robust against noise.<br />
o After training is complete, it takes little computational power to conduct statistical inference.<br />
Question that need to be answered:<br />
o What AI architecture and datasets can provide evidence of symptoms, ECG changes, imaging evidence that shows loss of viable myocardium and wall motion abnormality?<br />
Purpose of the article:<br />
o To provide a detailed analysis on the contribution of each ECG lead in identifying heart disease in the model. This is because the selection of data in previous studies regarding heart disease made data selection arbitrary in identifying heart conditions.<br />
o To show the use of using multiple data channels of information to enhance prediction accuracy in deep learning, i.e. processing the top three ECG leads simultaneously in the neural network.<br />
o Show that feature engineering is not necessary in the training, validation, or testing process for ECG data in neural networks.<br />
Related work<br />
Database used: PTB database<br />
1. CNN Network<br />
a. Used both noisy and denoised ECGs without feature engineering.<br />
2. Used artificial neural network, probabilistic neural network, KNN, multi-layer perceptron, and Naïve Bayes Classification<br />
a. Extracted two features: T-wave integral and total integral to identify heart disease.<br />
3. Developed two different ANN: RBF and MLP<br />
4. Supervised learning techniques have limited success to the problem and used multiple instance learning.<br />
a. Demonstrate proposed algorithm LTMIL surpasses supervised approaches.<br />
5. Create new feature by approximating ECG signal using a 20th order polynomial, which achieved 94.4% accuracy.<br />
6. Stationary wavelet transforms to decompose ECG into sub-bands.<br />
a. SVM and KNN used to classify.<br />
b. Features used: sample entropy, normalized sub-bands, log energy entropy, median slope from sub-bands.<br />
7. Transfer learning – used deep CNN model for arrhythmia and developed it to detect heart disease.<br />
8. Simple adaptive threshold (SAT)<br />
a. Multiresolution approach, adaptive thresholding used to extract features: depth of Q peak and elevation in ST segment<br />
9. Subject-oriented approach using CNN to take in leads II, III, AVF<br />
10. Model ECG using 2nd order ODE and feed the best-fitting coefficients of the ECG signal into a SVM.<br />
11. Multi-channel CNN (16 layers) with long-short memory units<br />
12. Deep CNN that takes 3 seconds at a time of lead II as input<br />
Most use feature extraction/selection from the raw ECG data before training.<br />
o Problem with feature selection is that it is not practical for large volumes of data.<br />
Other papers that do not use feature selection arbitrarily picks ECG leads for classification and does not provide rationale.<br />
<br />
==Betty==<br />
<br />
Result: <br />
<br />
1. Quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii<br />
<br />
2. Quantification of accuracies for pairs of top 5 highest individual channels with 20-fold cross-validation, resulting highest pairs accuracies to fed into a the neural network: lead v6 and lead vz <br />
<br />
3. Use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly.<br />
<br />
4. Discussing 2 factors effecting model performance evaluation:<br />
<br />
1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion <br />
<br />
2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split <br />
<br />
5. Comparing with other models, the model in this article has the highest accuracy, specificity, and precision<br />
<br />
6. Further using 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz as record-wise split <br />
<br />
1） Discuss patient-wise split might result lower accuracy evaluation, however, it still maintain an average of 97.83%<br />
<br />
== Marty ==<br />
<br />
3.1: Data curation<br />
<br />
Dataset: 549 ECG records total 290 unique patients Each ECG record has a mean length of over 100s<br />
<br />
3.2: ANN model<br />
<br />
ConvNetQuake model + 1D batch normalization + Label-smoothing<br />
<br />
Model (PyTorch): - Input layer: 10-second long ECG signal - Hidden layers: 8 * (1D convolution layer, Activation function: RELU, 1D batch normalization layer) - Output layer: 1280 dimensions -> 1 dimension, Activation function: Sigmoid<br />
<br />
Batch size = 10 Learning rate = 10^-4 Optimizer = ADAM<br />
<br />
80-10-10: Train-Validation-Test</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=42816Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-10-26T00:15:01Z<p>A29mukhe: Created page with "e"</p>
<hr />
<div>e</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:A29mukhe&diff=42811User:A29mukhe2020-10-26T00:11:18Z<p>A29mukhe: </p>
<hr />
<div><br />
<br />
Result: <br />
1. Quantification of accuracies for single channels with 20-fold cross-validation, resulting highest individual accuracies: v5, v6, vx, vz, and ii<br />
<br />
2. Quantification of accuracies for pairs of top 5 highest individual channels with 20-fold cross-validation, resulting highest pairs accuracies to fed into a the neural network: lead v6 and lead vz <br />
<br />
3. Use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly.<br />
<br />
4. Discussing 2 factors effecting model performance evaluation:<br />
<br />
1） Random train-val-test split might have effects of the performance of the model, but it can be improved by access with a larger data set and further discussion <br />
<br />
2） random initialization of the weights of neural network shows little effects on the performance of the model performance evaluation, because of showing a high average results with a fixed train-val-test split <br />
<br />
5. Comparing with other models, the model in this article has the highest accuracy, specificity, and precision<br />
<br />
6. Further using 290 fold patient-wise split, resulting the same highest accuracy of the pair v6 and vz as record-wise split <br />
<br />
1） Discuss patient-wise split might result lower accuracy evaluation, however, it still maintain an average of 97.83%<br />
<br />
<br />
<br />
== Marty ==<br />
<br />
3.1: Data curation<br />
<br />
Dataset: 549 ECG records total<br />
290 unique patients<br />
Each ECG record has a mean length of over 100s<br />
<br />
3.2: ANN model<br />
<br />
ConvNetQuake model + 1D batch normalization + Label-smoothing<br />
<br />
Model (PyTorch):<br />
- Input layer: 10-second long ECG signal<br />
- Hidden layers: 8 * (1D convolution layer, Activation function: RELU, 1D batch normalization layer)<br />
- Output layer: 1280 dimensions -> 1 dimension, Activation function: Sigmoid<br />
<br />
Batch size = 10<br />
Learning rate = 10^-4<br />
Optimizer = ADAM<br />
<br />
80-10-10: Train-Validation-Test</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:A29mukhe&diff=42807User:A29mukhe2020-10-26T00:07:23Z<p>A29mukhe: Created page with "e"</p>
<hr />
<div>e</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_441/841_CM_763-Proposal&diff=42643F21-STAT 441/841 CM 763-Proposal2020-10-08T22:08:34Z<p>A29mukhe: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1 Group members:'''<br />
<br />
Song, Quinn<br />
<br />
Loh, William<br />
<br />
Bai, Junyue<br />
<br />
Choi, Phoebe<br />
<br />
'''Title:''' APTOS 2019 Blindness Detection<br />
<br />
'''Description:'''<br />
<br />
Our team chose the APTOS 2019 Blindness Detection Challenge from Kaggle. The goal of this challenge is to build a machine learning model that detects diabetic retinopathy by screening retina images.<br />
<br />
Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working-aged adults. It is caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina). In rural areas where medical screening is difficult to conduct, it is challenging to detect the disease efficiently. Aravind Eye Hospital hopes to utilize machine learning techniques to gain the ability to automatically screen images for disease and provide information on how severe the condition may be.<br />
<br />
Our team plans to solve this problem by applying our knowledge in image processing and classification.<br />
<br />
<br />
----<br />
<br />
'''Project # 2 Group members:'''<br />
<br />
Li, Dylan<br />
<br />
Li, Mingdao<br />
<br />
Lu, Leonie<br />
<br />
Sharman,Bharat<br />
<br />
'''Title:''' Risk prediction in life insurance industry using supervised learning algorithms<br />
<br />
'''Description:'''<br />
<br />
In this project, we aim to replicate and possibly improve upon the work of Jayabalan et al. in their paper “Risk prediction in life insurance industry using supervised learning algorithms”. We will be using the Prudential Life Insurance Data Set that the authors have used and have shared with us. We will be pre-processing the data to replace missing values, using feature selection using CFS and feature reduction using PCA use this processed data to perform Classification via four algorithms – Neural Networks, Random Tree, REPTree and Multiple Linear Regression. We will compare the performance of these Algorithms using MAE and RMSE metrics and come up with visualizations that can explain the results easily even to a non-quantitative audience. <br />
<br />
Our goal behind this project is to learn applying the algorithms that we learned in our class to an industry dataset and come up with results that we can aid better, data-driven decision making.<br />
<br />
----<br />
<br />
'''Project # 3 Group members:'''<br />
<br />
Parco, Russel<br />
<br />
Sun, Scholar<br />
<br />
Yao, Jacky<br />
<br />
Zhang, Daniel<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Lyft Motion Prediction for Autonomous Vehicles Kaggle competition. The aim of this competition is to build a model which given a set of objects on the road (pedestrians, other cars, etc), predict the future movement of these objects.<br />
<br />
Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Our aim is to apply classification techniques learned in class to optimally predict how these objects move.<br />
<br />
----<br />
<br />
'''Project # 4 Group members:'''<br />
<br />
Chow, Jonathan<br />
<br />
Dharani, Nyle<br />
<br />
Nasirov, Ildar<br />
<br />
'''Title:''' Classification with Abstinence<br />
<br />
'''Description:''' <br />
<br />
We seek to implement the algorithm described in [https://papers.nips.cc/paper/9247-deep-gamblers-learning-to-abstain-with-portfolio-theory.pdf Deep Gamblers: Learning to Abstain with Portfolio Theory]. The paper describes augmenting classification problems to include the option of abstaining from making a prediction when confidence is low.<br />
<br />
Medical imaging diagnostics is a field in which classification could assist professionals and improve life expectancy for patients through increased accuracy. However, there are also severe consequences to incorrect predictions. As such, we also hope to apply the algorithm implemented to the classification of medical images, specifically instances of normal and pneumonia [https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia? chest x-rays]. <br />
<br />
----<br />
<br />
'''Project # 5 Group members:'''<br />
<br />
Jones, Hayden<br />
<br />
Leung, Michael<br />
<br />
Haque, Bushra<br />
<br />
Mustatea, Cristian<br />
<br />
'''Title:''' Combine Convolution with Recurrent Networks for Text Classification<br />
<br />
'''Description:''' <br />
<br />
Our team chose to reproduce the paper [https://arxiv.org/pdf/2006.15795.pdf Combine Convolution with Recurrent Networks for Text Classification] on Arxiv. The goal of this paper is to combine CNN and RNN architectures in a way that more flexibly combines the output of both architectures other than simple concatenation through the use of a “neural tensor layer” for the purpose of improving at the task of text classification. In particular, the paper claims that their novel architecture excels at the following types of text classification: sentiment analysis, news categorization, and topical classification. Our team plans to recreate this paper by working in pairs of 2, one pair to implement the CNN pipeline and the other pair to implement the RNN pipeline. We will be working with Tensorflow 2, Google Collab, and reproducing the paper’s experimental results with training on the same 6 publicly available datasets found in the paper.<br />
<br />
----<br />
<br />
'''Project # 6 Group members:'''<br />
<br />
Chin, Ruixian<br />
<br />
Ong, Jason<br />
<br />
Chiew, Wen Cheen<br />
<br />
Tan, Yan Kai<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team chose to participate in a Kaggle research challenge "Mechanisms of Action (MoA) Prediction". This competition is a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
----<br />
<br />
'''Project # 7 Group members:'''<br />
<br />
Ren, Haotian <br />
<br />
Cheung, Ian Long Yat<br />
<br />
Hussain, Swaleh <br />
<br />
Zahid, Bin, Haris <br />
<br />
'''Title:''' Transaction Fraud Detection <br />
<br />
'''Description:''' <br />
<br />
Protecting people from fraudulent transactions is an important topic for all banks and internet security companies. This Kaggle project is based on the dataset from IEEE Computational Intelligence Society (IEEE-CIS). Our objective is to build a more efficient model in order to recognize each fraud transaction with a higher accuracy and higher speed.<br />
----<br />
<br />
'''Project # 8 Group members:'''<br />
<br />
ZiJie, Jiang<br />
<br />
Yawen, Wang<br />
<br />
DanMeng, Cui<br />
<br />
MingKang, Jiang<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles <br />
<br />
'''Description:'''<br />
<br />
Our team chose to participate in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". We will apply our science skills to build motion prediction models for self-driving vehicles. The model will be able to predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians. The goal of this competition is to predict the trajectories of other traffic participants.<br />
<br />
----------------------------------------------------------------------<br />
<br />
<br />
'''Project # 9 Group members:'''<br />
<br />
Banno, Dion <br />
<br />
Battista, Joseph<br />
<br />
Kahn, Solomon <br />
<br />
'''Title:''' Increasing Spotify user engagement through predictive personalization<br />
<br />
'''Description:''' <br />
<br />
Our project is an application of classification to the domain of predictive personalization. The goal of the project is to increase Spotify user engagement through data-driven methods. Given a set of users’ demographic data, listening preferences and behaviour, our goal is to build a recommendation system that suggests new songs to users. From a potential pool of songs to suggest, the final song recommendations will be driven by a classification algorithm that measures a given user’s propensity to like a song. We plan on leveraging the Spotify Web API to gather data about songs and collecting user data from consenting peers.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 10 Group members:'''<br />
<br />
Qing, Guo <br />
<br />
Wang, Yuanxin<br />
<br />
James, Ni<br />
<br />
Xueguang, Ma<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Mechanisms of Action (MoA) Prediction Kaggle competition. This is a challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
Our team plan to develop an algorithm to predict a compound’s MoA given its cellular signature and our goal is to learn various algorithms taught in this course.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 11 Group members:'''<br />
<br />
Yang, Jiwon <br />
<br />
Mahdi, Anas<br />
<br />
Thibault, Will<br />
<br />
Lau, Jan<br />
<br />
'''Title:''' Application of classification in human fatigue analysis<br />
<br />
'''Description:''' <br />
<br />
The goal of this project is to classify different levels of fatigue based on motion capture (Vicon) and force plates data. First, we plan to obtain data from 4 to 6 participants performing squats or squats with weights and rate them on a fatigue scale, with each participant doing at least 50 to 100 reps. We will collect data with EMG, IMU, force plates, and Vicon. When the participants are squatting, we will ask them about their fatigue level, and compare their feedback against the fatigue level recorded on EMG. The fatigue level will be on a scale of 1 to 10 (1 being not fatigued at all and 10 being cannot continue anymore). Once data is collected, we will classify the motion capture and force plates data into the different levels of fatigue.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 12 Group members:'''<br />
<br />
Xiaolan Xu, <br />
<br />
Robin Wen, <br />
<br />
Yue Weng, <br />
<br />
Beizhen Chang<br />
<br />
'''Title:''' Identification (Classification) of Submillimetre Galaxies Based on Multiwavelength Data in Astronomy<br />
<br />
'''Description:''' <br />
<br />
Identifying the counterparts of submillimetre galaxies (SMGs) in multiwavelength images is important to the study of galaxy evolution in astronomy. However, obtaining a statistically significant sample of robust associations is very challenging because of the poor angular resolution of single-dish submm facilities, that is we can not tell which galalxy is actually responsible for the submillimeter emission from a group of possible candidates due to the poor resolution. Recently, a set of labelled dataset is obtained from ALMA, a milliemetre/submilliemetre telescope array with the sufficient resolution to pin down the exact source of submillimeter emssion. However, applying such array to large fraction of skies are not feasible, so it is of practical interest to develop algorithm to identify submillimetre galaxies (SMGs) based on the other available data. With this newly labelled dataset from ALMA, it is possible to test and develop different new alrgorithms and apply them on unlabelled data to detect submillimetre galaxies.<br />
<br />
In our work, we primarily build on the works of Liu et al.(https://arxiv.org/abs/1901.09594), which tested a set of standard classification algorithms to the dataset. We aim to first reproduce their work and test other classification algorithms with a more stastics centered perspective. Next, we hope to possibly extend their works from one or some of the following directions: (1)Incorporating some other relevant features to augment the dimensions of the available dataset for better classification rate. (2)Taking the measurement error into the classifcation algorithms, possibly from a Bayesian approach. (All features in astronomy datasets come from actual physical measurements, which come with an error bar. However, it is not clear how to incoporate this error into the classification task.) (3)The possibility of combining some tradtional astronomy approaches with algorithms from ML.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 13 Group members:'''<br />
<br />
<br />
Zihui (Betty) Qin,<br />
<br />
Wenqi (Maggie) Zhao,<br />
<br />
Muyuan Yang,<br />
<br />
Amartya (Marty) Mukherjee,<br />
<br />
'''Title:''' Insider Trading Roles Classification Prediction on United States conventional stock or non-derivative transaction<br />
<br />
'''Description:'''<br />
<br />
Background (why we were interested in classifying based on insiders): <br />
The United States is one of the most frequently traded financial markets in the world. The dataset captures all insider activities as reported on SEC (U.S. Securities and Exchange Commission) forms 3, 4, 5, and 144. We believe that using variables (such as transaction date, security type, and transaction amount), we could predict the roles code for a new transaction. The reason for the chosen prediction is that the role of the insider gives investors signals of potential internal activities and private information. This is crucial for investors to detect important market signals from those insider trading activities, such that they could benefit from the market. <br />
<br />
Goal: To classify the role of an insider in a company based on the data of their trades.</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F21&diff=42642stat441F212020-10-08T21:44:52Z<p>A29mukhe: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==<br />
<br />
<!--[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]--><br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="250pt"|Name <br />
|width="15pt"|Paper number <br />
|width="700pt"|Title<br />
|width="15pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||<br />
|-<br />
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks] || ||<br />
|-<br />
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| || || ||<br />
|-<br />
|Week of Nov 16 || Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun || 6|| || || ||<br />
|-<br />
|Week of Nov 16 || || 7|| || || ||<br />
|-<br />
|Week of Nov 16 || || 8|| || || ||<br />
|-<br />
|Week of Nov 16 || || 9|| || || ||<br />
|-<br />
|Week of Nov 16 || || 10|| || || ||<br />
|-<br />
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Netorks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || || 13|| || || ||<br />
|-<br />
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Rui Gong, Xuetong Wang, Xinqi Ling, Di Ma || 15|| || || ||<br />
|-<br />
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| Graph Structure of Neural Networks || [https://proceedings.icml.cc/paper/2020/file/757b505cfd34c64c85ca5b5690ee5293-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai, Leyan Cheng || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr] || ||<br />
|-<br />
|Week of Nov 30 || || 21|| || || ||<br />
|-<br />
|Week of Nov 30 || || 22|| || || ||<br />
|-<br />
|Week of Nov 30 || || 23|| || || ||<br />
|-<br />
|Week of Nov 30 || || 24|| || || ||<br />
|-<br />
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition<br />
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 30 ||Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee || 26|| Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms || [https://arxiv.org/pdf/1912.07618.pdf?fbclid=IwAR0RwATSn4CiT3qD9LuywYAbJVw8YB3nbex8Kl19OCExIa4jzWaUut3oVB0 Paper] || ||<br />
|-<br />
|Week of Nov 30 || || 27|| || || ||<br />
|-<br />
|Week of Nov 30 || Yawen Wang, DanMeng Cui, ZiJie Jiang, MingKang Jiang || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Bertrand Sodjahin, Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || ||<br />
|-</div>A29mukhehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F21&diff=42641stat441F212020-10-08T21:43:55Z<p>A29mukhe: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==<br />
<br />
<!--[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]--><br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="250pt"|Name <br />
|width="15pt"|Paper number <br />
|width="700pt"|Title<br />
|width="15pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||<br />
|-<br />
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks] || ||<br />
|-<br />
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| || || ||<br />
|-<br />
|Week of Nov 16 || Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun || 6|| || || ||<br />
|-<br />
|Week of Nov 16 || || 7|| || || ||<br />
|-<br />
|Week of Nov 16 || || 8|| || || ||<br />
|-<br />
|Week of Nov 16 || || 9|| || || ||<br />
|-<br />
|Week of Nov 16 || || 10|| || || ||<br />
|-<br />
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Netorks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || || 13|| || || ||<br />
|-<br />
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Rui Gong, Xuetong Wang, Xinqi Ling, Di Ma || 15|| || || ||<br />
|-<br />
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| Graph Structure of Neural Networks || [https://proceedings.icml.cc/paper/2020/file/757b505cfd34c64c85ca5b5690ee5293-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai, Leyan Cheng || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr] || ||<br />
|-<br />
|Week of Nov 30 || || 21|| || || ||<br />
|-<br />
|Week of Nov 30 || || 22|| || || ||<br />
|-<br />
|Week of Nov 30 || || 23|| || || ||<br />
|-<br />
|Week of Nov 30 || || 24|| || || ||<br />
|-<br />
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition<br />
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 30 ||Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Zhao, Amartya (Marty) Mukherjee || 26|| [https://arxiv.org/pdf/1912.07618.pdf?fbclid=IwAR0RwATSn4CiT3qD9LuywYAbJVw8YB3nbex8Kl19OCExIa4jzWaUut3oVB0 Paper] || || ||<br />
|-<br />
|Week of Nov 30 || || 27|| || || ||<br />
|-<br />
|Week of Nov 30 || Yawen Wang, DanMeng Cui, ZiJie Jiang, MingKang Jiang || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Bertrand Sodjahin, Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || ||<br />
|-</div>A29mukhe