statwiki - User contributions [US]

Mask RCNN

2020-12-03T20:14:44Z

D5cui: /* Critiques */

== Presented by ==
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang

== Introduction ==
Mask RCNN [1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision which is important when attempting to identify different objects within the same image.
Mask R-CNN extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.

== Visual Perception Tasks ==

Figure 1 shows a visual representation of different types of visual perception tasks:

- Image Classification: Predict a set of labels to characterize the contents of an input image

- Object Detection: Build on image classification but localize each object in an image

- Semantic Segmentation: Associate every pixel in an input image with a class label

- Instance Segmentation: Associate every pixel in an input image to a specific object

[[File:instance segmentation.png | center]]
<div align="center">Figure 1: Visual Perception tasks</div>

Mask RCNN is a deep neural network architecture for Instance Segmentation.

== Related Work ==
Region Proposal Network: A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.

ROI Pooling: The main use of ROI (Region of Interest) Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section.

Faster R-CNN: Faster R-CNN consists of two stages: Region Proposal Network and ROI Pooling. Region Proposal Network proposes candidate object bounding boxes. ROI Pooling, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.

[[File:FasterRCNN.png | center]]
<div align="center">Figure 2: Faster RCNN architecture</div>

ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale. Other than FPN, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.

[[File:ResNetFPN.png | center]]
<div align="center">Figure 3: ResNetFPN architecture</div>

== Model Architecture ==
The structure of mask R-CNN is quite similar to the structure of faster R-CNN.
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying the stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI as <math>L=L_{cls}+L_{box}+L_{mask}</math>, where <math>L_{cls}</math>, <math>L_{box}</math>, <math>L_{mask}</math> represent the classification loss, bounding box loss and the average binary cross-entropy loss respectively.

The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, complete R-CNN pipeline stages involve 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stages 3 and 4 are adjusted to simplify the network procedures.

The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.
One thing worth noticing is that for other network systems, those masks across classes compete with each other, but in this particular case, with a
per-pixel sigmoid and a binary loss the masks across classes no longer compete, which makes this formula the key for good instance segmentation results.

Another important concept involved is called RoIAlign. This concept is useful in stage 2 where the RoIPool extracts
features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to the pixel-to-pixel correspondence. The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. Also, instead of quantization, the coordinates are computed using bilinear interpolation to guarantee spatial correspondence.

The network architectures utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction.

Some implementation details should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.

== Results ==
[[File:ExpInstanceSeg.png | center]]
<div align="center">Figure 4: Instance Segmentation Experiments</div>

Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of art model

[[File:BoundingBoxExp.png | center]]
<div align="center">Figure 5: Bounding Box Detection Experiments</div>

Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.

== Ablation Experiments ==
[[File:BackboneExp.png | center]]
<div align="center">Figure 6: Backbone Architecture Experiments</div>

(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet.

[[File:MultiVSInde.png | center]]
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div>

(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).

[[File: RoIAlign.png | center]]
<div align="center">Figure 8: RoIAlign Experiments 1</div>

(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers.

[[File: RoIAlignExp.png | center]]
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div>

(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.

[[File:MaskBranchExp.png | center]]
<div align="center">Figure 10: Mask Branch Experiments</div>

(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.

== Human Pose Estimation ==
Mask RCNN can be extended to human pose estimation.

The simple approach the paper presents is to model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types such as left shoulder, right elbow.

[[File:HumanPose.png | center]]
<div align="center">Figure 11: Keypoint Detection Results</div>

== Conclusion ==
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.

== Critiques ==
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.

It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.

The paper lacks the comparisons of different methods and Mask RNN on unlabeled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.

The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.

The Mask RCNN could be a candidate model to do short-term predictions on the physical behaviors of a person, which could be very useful at crime scenes.

An interesting application of Mask RCNN would be on face recognition from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.

The main problem for CNN architectures like Mask RCNN is the running time. Due to slow running times, Single Shot Detector algorithms are preferred for applications like video or live stream detections, where a faster running time would mean a better response to changes in frames. It would be beneficial to have a graphical representation of the Mask RCNN running times against single shot detector algorithms such as YOLOv3.

It is interesting to investigate a solution of embedding instance segmentation with semantic segmentation to improve time performance. Because in many situations, knowing the exact boundary of an object is not necessary.

It will be better if we can have more comparisons with other models. It will also be nice if we can have more details about why Mask RCNN can perform better, and how about the efficiency of it?
The authors mentioned that Mask R-CNN is a deep neural network architecture for Instance Segmentation. It's better to include more background information about this task. For example, challenges of this task (e.g. the model will need to take into account the overlapping of objects) and limitations of existing methods.

It would be interesting to see how a postprocessing step with conditional random fields (CRF) might improve (or not?) segmentation. It would also have been interesting to see the performance of the method with lighter backbones since the backbones used to have a very large inference time which makes them unsuitable for many applications.

An extension of the application of Mask RCNN in medical AI is to highlight areas of an MRI scan that correlate to certain behavioral/psychological patterns.

== References ==
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.

[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.

[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015

Mask RCNN

2020-12-03T20:12:55Z

D5cui: /* Model Architecture */

== Presented by ==
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang

== Introduction ==
Mask RCNN [1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision which is important when attempting to identify different objects within the same image.
Mask R-CNN extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.

== Visual Perception Tasks ==

Figure 1 shows a visual representation of different types of visual perception tasks:

- Image Classification: Predict a set of labels to characterize the contents of an input image

- Object Detection: Build on image classification but localize each object in an image

- Semantic Segmentation: Associate every pixel in an input image with a class label

- Instance Segmentation: Associate every pixel in an input image to a specific object

[[File:instance segmentation.png | center]]
<div align="center">Figure 1: Visual Perception tasks</div>

Mask RCNN is a deep neural network architecture for Instance Segmentation.

== Related Work ==
Region Proposal Network: A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.

ROI Pooling: The main use of ROI (Region of Interest) Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section.

Faster R-CNN: Faster R-CNN consists of two stages: Region Proposal Network and ROI Pooling. Region Proposal Network proposes candidate object bounding boxes. ROI Pooling, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.

[[File:FasterRCNN.png | center]]
<div align="center">Figure 2: Faster RCNN architecture</div>

ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale. Other than FPN, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.

[[File:ResNetFPN.png | center]]
<div align="center">Figure 3: ResNetFPN architecture</div>

== Model Architecture ==
The structure of mask R-CNN is quite similar to the structure of faster R-CNN.
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying the stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI as <math>L=L_{cls}+L_{box}+L_{mask}</math>, where <math>L_{cls}</math>, <math>L_{box}</math>, <math>L_{mask}</math> represent the classification loss, bounding box loss and the average binary cross-entropy loss respectively.

The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, complete R-CNN pipeline stages involve 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stages 3 and 4 are adjusted to simplify the network procedures.

The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.
One thing worth noticing is that for other network systems, those masks across classes compete with each other, but in this particular case, with a
per-pixel sigmoid and a binary loss the masks across classes no longer compete, which makes this formula the key for good instance segmentation results.

Another important concept involved is called RoIAlign. This concept is useful in stage 2 where the RoIPool extracts
features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to the pixel-to-pixel correspondence. The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. Also, instead of quantization, the coordinates are computed using bilinear interpolation to guarantee spatial correspondence.

The network architectures utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction.

Some implementation details should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.

== Results ==
[[File:ExpInstanceSeg.png | center]]
<div align="center">Figure 4: Instance Segmentation Experiments</div>

Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of art model

[[File:BoundingBoxExp.png | center]]
<div align="center">Figure 5: Bounding Box Detection Experiments</div>

Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.

== Ablation Experiments ==
[[File:BackboneExp.png | center]]
<div align="center">Figure 6: Backbone Architecture Experiments</div>

(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet.

[[File:MultiVSInde.png | center]]
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div>

(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).

[[File: RoIAlign.png | center]]
<div align="center">Figure 8: RoIAlign Experiments 1</div>

(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers.

[[File: RoIAlignExp.png | center]]
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div>

(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.

[[File:MaskBranchExp.png | center]]
<div align="center">Figure 10: Mask Branch Experiments</div>

(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.

== Human Pose Estimation ==
Mask RCNN can be extended to human pose estimation.

The simple approach the paper presents is to model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types such as left shoulder, right elbow.

[[File:HumanPose.png | center]]
<div align="center">Figure 11: Keypoint Detection Results</div>

== Conclusion ==
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.

== Critiques ==
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.

It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.

The paper lacks the comparisons of different methods and Mask RNN on unlabeled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.

The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.

The Mask RCNN could be a candidate model to do short term prediction on physical behaviors of a person, which could be very useful at crime scenes.

An interesting application of Mask RCNN would be on face recognition from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.

The main problems for CNN architectures like Mask RCNN is the running time. Due to slow running times, Single Shot Detector algorithms are preferred for applications like video or live stream detections, where a faster running time would mean a better response to changes in frames. It would be beneficial to have a graphical representation of the Mask RCNN running times against single shot detector algorithms such as YOLOv3.

It is interesting to investigate a solution of embedding instance segmentation with semantic segmentation in order to improve time performance. Because in many situations, knowing the exact boundary of an object is not necessary.

It will be better if we can have more comparisons with other models. It will also be nice if we can have more details about why Mask RCNN can perform better, and how about the efficiency of it?

Authors mentioned that Mask R-CNN is a deep neural network architecture for Instance Segmentation. It's better to include more background information about this task. For example, challenges of this task (e.g. the model will need to takes into account the overlapping of objects) and limitations of existing methods.

It would be interesting to see how a postprocessing step with conditional random fields (CRF) might improve (or not?) segmentation. It would also have been interesting to see performance of the method with lighter backbones since the backbones used have a very large inference time which make them unsuitable for many applications.

An extension of the application of Mask RCNN in medical AI is to highlight areas of an MRI scan that correlate to certain behavioral/psychological patterns.

== References ==
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.

[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.

[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015

Research Papers Classification System

2020-12-03T20:10:05Z

D5cui: /* Conclusion */

= Presented by =
Jill Wang, Junyi (Jay) Yang, Yu Min (Chris) Wu, Chun Kit (Calvin) Li

= Introduction =
This paper introduces a paper classification system that utilizes the Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and K-means clustering. The most important technology the system used to process big data is the Hadoop Distributed File System (HDFS). The system can handle quantitatively complex research paper classification problems efficiently and accurately.

===General Framework===

The paper classification system classifies research papers based on the abstracts given that the core of most papers is presented in the abstracts.

<ol><li>Paper Crawling
<p>Collects abstracts from research papers published during a given period</p></li>
<li>Preprocessing
<p> <ol style="list-style-type:lower-alpha"><li>Removes stop words in the papers crawled, in which only nouns are extracted from the papers</li>
<li>generates a keyword dictionary, keeping only the top-N keywords with the highest frequencies</li> </ol>
</p></li>
<li>Topic Modelling
<p> Use the LDA to group the keywords into topics</p>
</li>
<li>Paper Length Calculation
<p> Calculates the total number of occurrences of words to prevent an unbalanced TF values caused by the various length of abstracts using the map-reduce algorithm</p>
</li>
<li>Word Frequency Calculation
<p> Calculates the Term Frequency (TF) values which represent the frequency of keywords in a research paper</p>
</li>
<li>Document Frequency Calculation
<p> Calculates the Document Frequency (DF) values which represents the frequency of keywords in a collection of research papers. The higher the DF value, the lower the importance of a keyword.</p>
</li>
<li>TF-IDF calculation
<p> Calculates the inverse of the DF which represents the importance of a keyword.</p>
</li>
<li>Paper Classification
<p> Classify papers by topics using the K-means clustering algorithm.</p>
</li>
</ol>

===Technologies===

The HDFS with a Hadoop cluster composed of one master node, one sub-node, and four data nodes is what is used to process the massive paper data. Hadoop-2.6.5 version in Java is what is used to perform the TF-IDF calculation. Spark MLlib is what is used to perform the LDA. The Scikit-learn library is what is used to perform the K-means clustering.

===HDFS===

Hadoop Distributed File System was used to process big data in this system. What Hadoop does is to break a big collection of data into different partitions and pass each partition to one individual processor. Each processor will only have information about the partition of data it has received.

'''In this summary, we are going to focus on introducing the main algorithms of what this system uses, namely LDA, TF-IDF, and K-Means.'''

=Data Preprocessing=
===Crawling of Abstract Data===

Under the assumption that audiences tend to first read the abstract of a paper to gain an overall understanding of the material, it is reasonable to assume the abstract section includes “core words” that can be used to effectively classify a paper's subject.

An abstract is crawled to have its stop words removed. Stop words are words that are usually ignored by search engines, such as “the”, “a”, and etc. Afterward, nouns are extracted, as a more condensed representation for efficient analysis.

This is managed on HDFS. The TF-IDF value of each paper is calculated through map-reduce.

===Managing Paper Data===

To construct an effective keyword dictionary using abstract data and keywords data in all of the crawled papers, the authors categorized keywords with similar meanings using a single representative keyword. The approach is called stemming, which is common in cleaning data. 1394 keyword categories are extracted, which is still too much to compute. Hence, only the top 30 keyword categories are used.

<div align="center">[[File:table_1_kswf.JPG|700px]]</div>

=Topic Modeling Using LDA=

Latent Dirichlet allocation (LDA) is a generative probabilistic model that views documents as random mixtures over latent topics. Each topic is a distribution over words, and the goal is to extract these topics from documents.

LDA estimates the topic-word distribution <math>P\left(t | z\right)</math> (probability of word "z" having topic "t") and the document-topic distribution <math>P\left(z | d\right)</math> (probability of finding word "z" within a given document "d") using Dirichlet priors for the distributions with a fixed number of topics. For each document, obtain a feature vector:

\[F = \left( P\left(z_1 | d\right), P\left(z_2 | d\right), \cdots, P\left(z_k | d\right) \right)\]

In the paper, authors extract topics from preprocessed paper to generate three kinds of topic sets, each with 10, 20, and 30 topics respectively. The following is a table of the 10 topic sets of highest frequency keywords.

<div align="center">[[File:table_2_tswtebls.JPG|700px]]</div>

===LDA Intuition===

LDA uses the Dirichlet priors of the Dirichlet distribution, which allows the algorithm to model a probability distribution ''over prior probability distributions of words and topics''. The following picture illustrates 2-simplex Dirichlet distributions with different alpha values, one for each corner of the triangles.

<div align="center">[[File:dirichlet_dist.png|700px]]</div>

Simplex is a generalization of the notion of a triangle. In Dirichlet distribution, each parameter will be represented by a corner in simplex, so adding additional parameters implies increasing the dimensions of simplex. As illustrated, when alphas are smaller than 1 the distribution is dense at the corners. When the alphas are greater than 1 the distribution is dense at the centers.

The following illustration shows an example LDA with 3 topics, 4 words and 7 documents.

<div align="center">[[File:LDA_example.png|800px]]</div>

In the left diagram, there are three topics, hence it is a 2-simplex. In the right diagram there are four words, hence it is a 3-simplex. LDA essentially adjusts parameters in Dirichlet distributions and multinomial distributions (represented by the points), such that, in the left diagram, all the yellow points representing documents and, in the right diagram, all the points representing topics, are as close to a corner as possible. In other words, LDA finds topics for documents and also finds words for topics. At the end topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> are produced.

=Term Frequency Inverse Document Frequency (TF-IDF) Calculation=

TF-IDF is widely used to evaluate the importance of a set of words in the fields of information retrieval and text mining. It is a combination of term frequency (TF) and inverse document frequency (IDF). The idea behind this combination is
* It evaluates the importance of a word within a document, and
* It evaluates the importance of the word among the collection of all documents

The TF-IDF formula has the following form:

\[TF-IDF_{i,j} = TF_{i,j} \times IDF_{i}\]

where i stands for the <math>i^{th}</math> word and j stands for the <math>j^{th}</math> document.

===Term Frequency (TF)===

TF evaluates the percentage of a given word in a document. Thus, TF value indicates the importance of a word. The TF has a positive relation with the importance.

In this paper, we only calculate TF for words in the keyword dictionary obtained. For a given keyword i, <math>TF_{i,j}</math> is the number of times word i appears in document j divided by the total number of words in document j.

The formula for TF has the following form:

\[TF_{i,j} = \frac{n_{i,j} }{\sum_k n_{k,j} }\]

where i stands for the <math>i^{th}</math> word, j stands for the <math>j^{th}</math> document, <math>n_{i,j}</math> stands for the number of times words <math>t_i</math> appear in document <math>d_j</math> and <math>\sum_k n_{k,j} </math> stands for total number of occurence of words in document <math>d_j</math>.

Note that the denominator is the total number of words remaining in document j after crawling.

===Document Frequency (DF)===

DF evaluates the percentage of documents that contain a given word over the entire collection of documents. Thus, the higher DF value is, the less important the word is.

<math>DF_{i}</math> is the number of documents in the collection with word i divided by the total number of documents in the collection. The formula for DF has the following form:

\[DF_{i} = \frac{|d_k \in D: n_{i,k} > 0|}{|D|}\]

where <math>n_{i,k}</math> is the number of times word i appears in document k, |D| is the total number of documents in the collection.

Since DF and the importance of the word have an inverse relation, we use inverse document frequency (IDF) instead of DF.

===Inverse Document Frequency (IDF)===

In this paper, IDF is calculated in a log scale. Since we will receive a large number of documents, i.e, we will have a large |D|

The formula for IDF has the following form:

\[IDF_{i} = log\left(\frac{|D|}{|\{d_k \in D: n_{i,k} > 0\}|}\right)\]

As mentioned before, we will use HDFS. The actual formula applied is:

\[IDF_{i} = log\left(\frac{|D|+1}{|\{d_k \in D: n_{i,k} > 0\}|+1}\right)\]

The inverse document frequency gives a measure of how rare a certain term is in a given document corpus.

=Paper Classification Using K-means Clustering=

The K-means clustering is an unsupervised classification algorithm that groups similar data into the same class. It is an efficient and simple method that can work with different types of data attributes and is able to handle noise and outliers.
<br>

Given a set of <math>d</math> by <math>n</math> dataset <math>\mathbf{X} = \left[ \mathbf{x}_1 \cdots \mathbf{x}_n \right]</math>, the algorithm will assign each <math>\mathbf{x}_j</math> into <math>k</math> different clusters based on the characteristics of <math>\mathbf{x}_j</math> itself.
<br>

Moreover, when assigning data into a cluster, the algorithm will also try to minimise the distances between the data and the centre of the cluster which the data belongs to. That is, k-means clustering will minimize the sum of square error:

\begin{align*}
min \sum_{i=1}^{k} \sum_{j \in C_i} ||x_j - \mu_i||^2
\end{align*}

where
<ul>
<li><math>k</math>: the number of clusters</li>
<li><math>C_i</math>: the <math>i^th</math> cluster</li>
<li><math>x_j</math>: the <math>j^th</math> data in the <math>C_i</math></li>
<li><math>mu_i</math>: the centroid of <math>C_i</math></li>
<li><math>||x_j - \mu_i||^2</math>: the Euclidean distance between <math>x_j</math> and <math>\mu_i</math></li>
</ul>
<br>

Since the goal for this paper is to classify research papers and group papers with similar topics based on keywords, the paper uses the K-means clustering algorithm. The algorithm first computes the cluster centroid for each group of papers with a specific topic. Then, it will assign a paper into a cluster based on the Euclidean distance between the cluster centroid and the paper’s TF-IDF value.
<br>

However, different values of <math>k</math> (the number of clusters) will return different clustering results. Therefore, it is important to define the number of clusters before clustering. For example, in this paper, the authors choose to use the Elbow scheme to determine the value of <math>k</math>. The Elbow scheme is a somewhat subjective way of choosing an optimal <math>k</math> that involves plotting the average of the squared distances from the cluster centers of the respective clusters (distortion) as a function of <math>k</math> and choosing a <math>k</math> at which point the decrease in distortion is outweighed by the increase in complexity. Also, to measure the performance of clustering, the authors decide to use the Silhouette scheme. The results of clustering are validated if the Silhouette scheme returns a value greater than <math>0.5</math>.

=System Testing Results=

In this paper, the dataset has 3264 research papers from the Future Generation Computer System (FGCS) journal between 1984 and 2017. For constructing keyword dictionaries for each paper, the authors have introduced three methods as shown below:

<div align="center">[[File:table_3_tmtckd.JPG|700px]]</div>

Then, the authors use the Elbow scheme to define the number of clusters for each method with different numbers of keywords before running the K-means clustering algorithm. The results are shown below:

<div align="center">[[File:table_4_nocobes.JPG|700px]]</div>

According to Table 4, there is a positive correlation between the number of keywords and the number of clusters. In addition, method 3 combines the advantages for both method 1 and method 2; thus, method 3 requires the least clusters in total. On the other hand, the wrong keywords might be presented in papers; hence, it might not be possible to group papers with similar subjects correctly by using method 1 and so method 1 needs the most number of clusters in total.

Next, the Silhouette scheme had been used for measuring the performance for clustering. The average of the Silhouette values for each method with different numbers of keywords are shown below:

<div align="center">[[File:table_5_asv.JPG|700px]]</div>

Since the clustering is validated if the Silhouette’s value is greater than 0.5, for methods with 10 and 30 keywords, the K-means clustering algorithm produces good results.

To evaluate the accuracy of the classification system in this paper, the authors use the F-Score. The authors execute 5 times of experiment and use 500 randomly selected research papers for each trial. The following histogram shows the average value of F-Score for the three methods and different numbers of keywords:

<div align="center">[[File:fig_16_fsvotm.JPG|700px]]</div>

Note that “TFIDF” means method 1, “LDA” means method 2, and “TFIDF-LDA” means method 3. The number 10, 20, and 30 after each method is the number of keywords the method has used.
According to the histogram above, method 3 has the highest F-Score values than the other two methods with different numbers of keywords. Therefore, the classification system is most accurate when using method 3 as it combines the advantages for both method 1 and method 2.

=Conclusion=

This paper introduces a classification system that classifies research papers into different topics by using TF-IDF and LDA scheme with K-means clustering algorithm. The experimental results showed that the proposed system can classify the papers with similar subjects according to the keywords extracted from the abstracts of papers. This system allows users to search the papers they want quickly and with the most productivity.

Furthermore, this classification system might be also used in different types of texts (e.g. documents, tweets, etc.) instead of only classifying research papers.

=Critique=

In this paper, DF values are calculated within each partition. This results that for each partition, DF value for a given word will vary and may have an inconsistent result for different partition methods. As mentioned above, there might be a divide by zero problem since some partitions do not have documents containing a given word, but this can be solved by introducing a dummy document as the authors did. Another method that might be better at solving inconsistent results and the divide by zero problems is to have all partitions to communicate with their DF value. Then pass the merged DF value to all partitions to do the final IDF and TF-IDF value. Having all partitions to communicate with the DF value will guarantee a consistent DF value across all partitions and helps avoid a divide by zero problem as words in the keyword dictionary must appear in some documents in the whole collection.

This paper treated the words in the different parts of a document equivalently, it might perform better if it gives different weights to the same word in different parts. For example, if a word appears in the title of the document, it usually shows it's a main topic of this document so we can put more weight on it to categorize.

When discussing the potential processing advantages of this classification system for other types of text samples, has the effect of processing mixed samples (text and image or text and video) taken into consideration? IF not, in terms of text classification only, does it have an overwhelming advantage over traditional classification models?

The preprocessing should also include <math>n</math>-gram tokenization for topic modelling because some topics are inherently two words, such as machine learning where if it is seen separately, it implies different topics.

This system is very compute-intensive due to the large volumes of dictionaries that can be generated by processing large volumes of data. It would be nice to see how much data HDFS had to process and similarly how much time was saved by using Hadoop for data processing as opposed to centralized approach.

This system can be improved further in terms of computation times by utilizing other big data framework MapReduce, that can also use HDFS, by parallelizing their computation across multiple nodes for K-means clustering as discussed in (Jin, et al) [5].

It's not exactly clear what method 3 (TFIDF-LDA) is doing, how is it performing TF-IDF on the topics? Also it seems like the preprocessing step only keeps 10/20/30 top words? This seems like an extremely low number especially in comparison with the LDA which has 10/20/30 topics - what is the reason for so strongly limiting the number of words? It would also be interesting to see if both key words and topics are necessary - an ablation study showing the significance of both would be interesting.

It is better if the paper has an example with some topics on some research papers. Also it is better if we can visualize the distance between each research paper and the topic names

=References=

Blei DM, el. (2003). Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

Gil, JM, Kim, SW. (2019). Research paper classification systems based on TF-IDF and LDA schemes. ''Human-centric Computing and Information Sciences'', 9, 30. https://doi.org/10.1186/s13673-019-0192-7

Liu, S. (2019, January 11). Dirichlet distribution Motivating LDA. Retrieved November 2020, from https://towardsdatascience.com/dirichlet-distribution-a82ab942a879

Serrano, L. (Director). (2020, March 18). Latent Dirichlet Allocation (Part 1 of 2) [Video file]. Retrieved 2020, from https://www.youtube.com/watch?v=T05t-SqKArY

Jin, Cui, Yu. (2016). A New Parallelization Method for K-means. https://arxiv.org/ftp/arxiv/papers/1608/1608.06347.pdf

Speech2Face: Learning the Face Behind a Voice

2020-12-03T20:07:39Z

D5cui: /* Discussion and Critiques */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, separate speech from multiple concurrent sources, predict lip motion from speech, and even learn the emotion of the agents based on their voices. Aytar et al. [6] proposed a student-teacher training procedure in which a well established visual recognition model was used to transfer the knowledge obtained in the visual modality to the sound modality, using unlabeled videos.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. A generative adversarial network (GAN) model is one that uses a generator to produce seemingly possible data for training and a discriminator that identifies if the training data is fabricated by the generator or if it is real [7]. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
It seems to be a common trait among humans to imagine what some people look like when we hear their voices before we have seen what they look lke. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg|center]]

<div style="text-align:center;"> Figure 1. '''Speech2Face model and training pipeline''' </div>

The Speech2Face Model used to achieve the desired result consists of two parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG|center]]

<div style="text-align:center;"> Table 1: '''Voice encoder architecture''' </div>

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time and allows the model to be used for input speeches of varying lengths. Two fully connected layers at the end are used to return a 4096-dimensional facial feature output.

'''Training'''

The AVSSpeech dataset, a large-scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair color, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame that contains the face is extracted from each video and then inputted to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096-dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math>: <math> \mathbb{R}^{4096} \to \mathbb{R}^{2622}</math> and the first layer <math>f_{dec}</math> : <math> \mathbb{R}^{4096} \to \mathbb{R}^{1000}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}(a)\text{log}p_{(i)}(b)$$ $$p_{(i)}(a) = \frac{\text{exp}(a_i/T)}{\sum_j \text{exp}(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

<div style="text-align:center;"> Figure 2: '''Qualitative results on the AVSpeech test set''' </div>

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

<div style="text-align:center;"> Figure 3. '''Facial attribute evaluation''' </div>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

<div style="text-align:center;"> Table 2. '''Feature similarity''' </div>

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2F -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

<div style="text-align:center;"> Table 3. '''S2F -> Face retrieval performance''' </div>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, measures the probability that the K closest images to the model output includes the correct image of the speaker's face. A higher R@K score indicates better performance. From Table 3, above, we see that both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.

'''Additional Observations'''

Ablation studies were carried out to test the effect of audio duration and batch normalization. It was found that the duration of input audio during the training stage had little effect on convergence speed (comparing 3 and 6-second speech segments), while in the test stage longer input speech yields improvement in reconstruction quality. With respect to batch normalization (BN), it was found that without BN reconstructed faces would converge to an average face, while the inclusion of BN led to results which contained much richer facial features.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

There is evidence that the results of the model may be heavily influenced by external factors:

1. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the model's prediction of ethnicity towards white. The bias in the results shows that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Figure (11) highlights this shortcoming: The same man heard speaking in either English or Chinese was predicted to have a "white" appearance or an "asian" appearance respectively.

2. The model was shown to infer different face features based on language. This puts into question how heavily the model depends on the spoken language. The paper mentioned the quality of face reconstruction may be affected by uncommon languages, where English is the most popular language on Youtube(training set). Testing a more controlled sample where all speech recording was of the same language may help address this concern to determine the model's reliance on spoken language.

3. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender, and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face and may result in a compounding effect on the miss-classification rate.

4. Figure 4.b shows the AVSpeech dataset statistics. However, it doesn't show the statistics about speakers' ethnicity and the language of the video. If we train the model with a more comprehensive dataset that includes enough Asian/Indian English speakers and native language speakers will this increase the accuracy?

5. One concern about the source of the training data, i.e. the Youtube videos, is that resolution varies a lot since the videos are randomly selected. That may be the reason why the proposed model performs badly on some certain features. For example, it is hard to tell the age when the resolution is bad because the wrinkles on the face are neglected.

6. The topic of this project is very interesting, but I highly doubt this model will be practical in real-world problems. Because there are many factors to affect a person's sound in a real-world environment. Sounds such as phone clock, TV, car horn and so on. These sounds will decrease the accuracy of the predicted result of the model.

7. A lot of information can be obtained from someone's voice, this can potentially be useful for detective work and crime scene investigation. In our world of increasing surveillance, public voice recording is quite common and we can reconstruct images of potential suspects based on their voice. In order for this to be achieved, the model has to be thoroughly trained and tested to avoid false positives as it could have a highly destructive outcome for a falsely convicted suspect.

8. This is a very interesting topic, and this summary has a good structure for readers. Since this model uses Youtube to train model, but I think one problem is that most of the YouTubers are adult, and many additional reasons make this dataset highly unbalanced. What is more, some people may have a baby voice, this also could affect the performance of the model. But overall, this is a meaningful topic, it might help police to locate the suspects. So it might be interesting to apply this to the police.

9. In addition, it seems very unlikely that any results coming from this model would ever be held in regard even remotely close to being admissible in court to identify a person of interest until the results are improved and the model can be shown to work in real-world applications. Otherwise, there seems to be very little use for such technology and it could have negative impacts on people if they were to be depicted in an unflattering way by the model based on their voice.

10. Using voice as a factor of constructing the face is a good idea, but it seems like the data they have will have lots of noise and bias. The voice of a video might not come from the person in the video. There are so many YouTubers adjusting their voices before uploading their video and it's really hard to know whether they adjust their voice. Also, most YouTubers are adults so the model cannot have enough training samples about teenagers and kids.

11. It would be interesting to see how the performance changes with different face encoding sizes (instead of just 4096-D) and also difference face models (encoder/decoders) to see if better performance can be achieved. Also given that the dataset used was unbalanced, was the dataset used to train the face model the same dataset? or was a different dataset used (the model was pretrained). This could affect the performance of the model as well.

12. The audio input is transformed into a spectrogram before being used for training. They use STFT with a Hann window of 25 mm, a hop length of 10 ms, and 512 FFT frequency bands. They cite this method from a paper that focuses on speech separation, not speech classification. So, it would be interesting to see if there is a better way to do STFT, possibly with different hyperparameters (eg. different windowing, different number of bands), or if another type of transform (eg. wavelet transform) would have better results.

13. A easy way to get somewhat balanced data is to duplicate the data that are fewer.

14. This problem is interesting but is hard to generalize. This algorithm didn't account for other genders and mixed-race. In addition, the face recognition software Face++ introduces bias which can carry forward to Speech2Face algorithm. Face recognition algorithms are known to have higher error rates classifying darker-skinned individuals. Thus, it'll be tough to apply it to real-life scenarios like identifying suspects.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.

[7] “Overview of GAN Structure | Generative Adversarial Networks,” ''Google Developers'', 24-May-2019. [Online]. Available: https://developers.google.com/machine-learning/gan/gan_structure. [Accessed: 02-Dec-2020].

Speech2Face: Learning the Face Behind a Voice

2020-12-03T20:07:15Z

D5cui: /* Discussion and Critiques */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, separate speech from multiple concurrent sources, predict lip motion from speech, and even learn the emotion of the agents based on their voices. Aytar et al. [6] proposed a student-teacher training procedure in which a well established visual recognition model was used to transfer the knowledge obtained in the visual modality to the sound modality, using unlabeled videos.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. A generative adversarial network (GAN) model is one that uses a generator to produce seemingly possible data for training and a discriminator that identifies if the training data is fabricated by the generator or if it is real [7]. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
It seems to be a common trait among humans to imagine what some people look like when we hear their voices before we have seen what they look lke. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg|center]]

<div style="text-align:center;"> Figure 1. '''Speech2Face model and training pipeline''' </div>

The Speech2Face Model used to achieve the desired result consists of two parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG|center]]

<div style="text-align:center;"> Table 1: '''Voice encoder architecture''' </div>

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time and allows the model to be used for input speeches of varying lengths. Two fully connected layers at the end are used to return a 4096-dimensional facial feature output.

'''Training'''

The AVSSpeech dataset, a large-scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair color, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame that contains the face is extracted from each video and then inputted to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096-dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math>: <math> \mathbb{R}^{4096} \to \mathbb{R}^{2622}</math> and the first layer <math>f_{dec}</math> : <math> \mathbb{R}^{4096} \to \mathbb{R}^{1000}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}(a)\text{log}p_{(i)}(b)$$ $$p_{(i)}(a) = \frac{\text{exp}(a_i/T)}{\sum_j \text{exp}(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

<div style="text-align:center;"> Figure 2: '''Qualitative results on the AVSpeech test set''' </div>

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

<div style="text-align:center;"> Figure 3. '''Facial attribute evaluation''' </div>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

<div style="text-align:center;"> Table 2. '''Feature similarity''' </div>

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2F -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

<div style="text-align:center;"> Table 3. '''S2F -> Face retrieval performance''' </div>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, measures the probability that the K closest images to the model output includes the correct image of the speaker's face. A higher R@K score indicates better performance. From Table 3, above, we see that both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.

'''Additional Observations'''

Ablation studies were carried out to test the effect of audio duration and batch normalization. It was found that the duration of input audio during the training stage had little effect on convergence speed (comparing 3 and 6-second speech segments), while in the test stage longer input speech yields improvement in reconstruction quality. With respect to batch normalization (BN), it was found that without BN reconstructed faces would converge to an average face, while the inclusion of BN led to results which contained much richer facial features.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

There is evidence that the results of the model may be heavily influenced by external factors:

There is evidence that the results of the model may be heavily influenced by external factors:
1. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the model's prediction of ethnicity towards white. The bias in the results shows that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Figure (11) highlights this shortcoming: The same man heard speaking in either English or Chinese was predicted to have a "white" appearance or an "asian" appearance respectively.

2. The model was shown to infer different face features based on language. This puts into question how heavily the model depends on the spoken language. The paper mentioned the quality of face reconstruction may be affected by uncommon languages, where English is the most popular language on Youtube(training set). Testing a more controlled sample where all speech recording was of the same language may help address this concern to determine the model's reliance on spoken language.

3. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender, and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face and may result in a compounding effect on the miss-classification rate.

4. Figure 4.b shows the AVSpeech dataset statistics. However, it doesn't show the statistics about speakers' ethnicity and the language of the video. If we train the model with a more comprehensive dataset that includes enough Asian/Indian English speakers and native language speakers will this increase the accuracy?

5. One concern about the source of the training data, i.e. the Youtube videos, is that resolution varies a lot since the videos are randomly selected. That may be the reason why the proposed model performs badly on some certain features. For example, it is hard to tell the age when the resolution is bad because the wrinkles on the face are neglected.

6. The topic of this project is very interesting, but I highly doubt this model will be practical in real-world problems. Because there are many factors to affect a person's sound in a real-world environment. Sounds such as phone clock, TV, car horn and so on. These sounds will decrease the accuracy of the predicted result of the model.

7. A lot of information can be obtained from someone's voice, this can potentially be useful for detective work and crime scene investigation. In our world of increasing surveillance, public voice recording is quite common and we can reconstruct images of potential suspects based on their voice. In order for this to be achieved, the model has to be thoroughly trained and tested to avoid false positives as it could have a highly destructive outcome for a falsely convicted suspect.

8. This is a very interesting topic, and this summary has a good structure for readers. Since this model uses Youtube to train model, but I think one problem is that most of the YouTubers are adult, and many additional reasons make this dataset highly unbalanced. What is more, some people may have a baby voice, this also could affect the performance of the model. But overall, this is a meaningful topic, it might help police to locate the suspects. So it might be interesting to apply this to the police.

9. In addition, it seems very unlikely that any results coming from this model would ever be held in regard even remotely close to being admissible in court to identify a person of interest until the results are improved and the model can be shown to work in real-world applications. Otherwise, there seems to be very little use for such technology and it could have negative impacts on people if they were to be depicted in an unflattering way by the model based on their voice.

10. Using voice as a factor of constructing the face is a good idea, but it seems like the data they have will have lots of noise and bias. The voice of a video might not come from the person in the video. There are so many YouTubers adjusting their voices before uploading their video and it's really hard to know whether they adjust their voice. Also, most YouTubers are adults so the model cannot have enough training samples about teenagers and kids.

11. It would be interesting to see how the performance changes with different face encoding sizes (instead of just 4096-D) and also difference face models (encoder/decoders) to see if better performance can be achieved. Also given that the dataset used was unbalanced, was the dataset used to train the face model the same dataset? or was a different dataset used (the model was pretrained). This could affect the performance of the model as well.

12. The audio input is transformed into a spectrogram before being used for training. They use STFT with a Hann window of 25 mm, a hop length of 10 ms, and 512 FFT frequency bands. They cite this method from a paper that focuses on speech separation, not speech classification. So, it would be interesting to see if there is a better way to do STFT, possibly with different hyperparameters (eg. different windowing, different number of bands), or if another type of transform (eg. wavelet transform) would have better results.

13. A easy way to get somewhat balanced data is to duplicate the data that are fewer.

14. This problem is interesting but is hard to generalize. This algorithm didn't account for other genders and mixed-race. In addition, the face recognition software Face++ introduces bias which can carry forward to Speech2Face algorithm. Face recognition algorithms are known to have higher error rates classifying darker-skinned individuals. Thus, it'll be tough to apply it to real-life scenarios like identifying suspects.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.

[7] “Overview of GAN Structure | Generative Adversarial Networks,” ''Google Developers'', 24-May-2019. [Online]. Available: https://developers.google.com/machine-learning/gan/gan_structure. [Accessed: 02-Dec-2020].

Evaluating Machine Accuracy on ImageNet

2020-12-03T20:03:54Z

D5cui: /* Human Labeler Evaluation */

== Presented by ==
Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du

== Introduction ==
ImageNet is the most influential data set in machine learning with images and corresponding labels over 1000 classes. This paper intends to explore the causes for performance differences between human experts and machine learning models, more specifically, CNN, on ImageNet.

Firstly, some images may fall into multiple classes. As a result, it is possible to underestimate the performance if we map each image to strictly one label, which is what is being done in the top-1 metric. Therefore, we adopt both top-1 and top-5 metrics where the performances of models, unlike human labelers, are linearly correlated in both cases.

Secondly, in contrast to the uniform performance of models in classes, humans tend to achieve better performances on inanimate objects. Human labelers achieve similar overall accuracies as the models, which indicates spaces of improvements on specific classes for machines.

Lastly, the setup of drawing training and test sets from the same distribution may favour models over human labelers. That is, the accuracy of multi-class prediction from models drops when the testing set is drawn from a different distribution than the training set, ImageNetV2. But this shift in distribution does not cause a problem for human labelers.

== Experiment Setup ==
=== Overview ===
There are four main phases to the experiment, which are (i) initial multilabel annotation, (ii) human labeler training, (iii) human labeler evaluation, and (iv) final annotation overview. The five authors of the paper are the participants in the experiments.

A brief overview of the four phases is as follows:
[[File:Experiment Set Up.png |800px| center]]

=== Initial multi-label annotation ===
Three labelers A, B, and C provided multi-label annotations for a subset from the ImageNet validation set, and all images from the ImageNetV2 test sets. These experiences give A, B, and C extensive experience with the ImageNet dataset.

=== Human Labeler Training ===
All five labelers trained on labeling a subset of the remaining ImageNet images. "Training" the human labelers consisted of teaching the humans the distinctions between very similar classes in the training set. For example, there are 118 classes of "dog" within ImageNet and typical human participants will not have working knowledge of the names of each breed of dog seen even if they can recognize and distinguish that breed from others.

=== Human Labeler Evaluation ===
Class-balanced random samples, which contains 1,000 images from the 20,000 annotated images are generated from both the ImageNet validation set and ImageNetV2. Five participants labeled these images over 28 days.

=== Final annotation Review ===
All labelers reviewed the additional annotations generated in the human labeler evaluation phase.

== Multi-label annotations==
[[File:Categories Multilabel.png|800px|center]]
<div align="center">Figure 3</div>

===Top-1 accuracy===
With Top-1 accuracy being the standard accuracy measure used in classification studies, it measures the proportions of examples for which the predicted label matches the single target label. As many images often contain more than one object for classification, for example, Figure 3a contains a desk, laptop, keyboard, space bar, and more. With Figure 3b showing a centered prominent figure yet labeled otherwise (people vs picket fence), it can be seen how a single target label is inaccurate for such a task since identifying the main objects in the image does not suffice due to its overly stringent and punishes predictions that are the main image yet does not match its label.
===Top-5 accuracy===
With Top-5 considers a classification correct if the object label is in the top 5 predicted labels, it partially resolves the problem with Top-1 labeling yet it is still not ideal since it can trivialize class distinctions. For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations.
===Multi-label accuracy===
The paper then proposes that for every image, the image shall have a set of target labels and a prediction; if such prediction matches one of the labels, it will be considered as correct labeling. Due to the above-discussed limitations of Top-1 and Top-5 metrics, the paper claims it is necessary for rigorous accuracy evaluation on the dataset.

===Types of Multi-label annotations===
====Multiple objects or organisms====
For the images containing more than one object or organism that corresponds to ImageNet, the paper proposed to add an additional target label for each entity in the image. With the discussed image in Figure 3b, the class groom, bow tie, suit, gown, and hoopskirt are all present in the foreground which is then subsequently added to the set of labels.
====Synonym or subset relations====
For similar classes, the paper considers them as under the same bigger class, that is, for two similarly labeled images, classification is considered correct if the produced label matches either one of the labels. For instance, warthog, African elephant, and Indian element all have prominent tusks, they will be considered subclasses of the tusker, Figure 3c shows a modification of labels to contain tusker as a correct label.
====Unclear Image====
In certain cases such as Figure 3d, there is a distinctive difficulty to determine whether a label was correct due to ambiguities in the class hierarchy.
===Collecting multi-label annotations===
Participants reviewed all predictions made by the models on the dataset ImageNet and ImageNet-V2, the participants then categorized every unique prediction made by the models on the dataset into correct and incorrect labels in order to allow all images to have multiple correct labels to satisfy the above-listed method.
===The multi-label accuracy metric===
One prediction is only correct if and only if it was marked correct by the expert reviewers during the annotation stage. As discussed in the experiment setup section, after human labelers have completed labeling, a second annotation stage is conducted. In Figure 4, a comparison of Top-1, Top-5, and multi-label accuracies showed higher Top-1 and Top-5 accuracy corresponds with higher multi-label accuracy as expected. With multi-label accuracies measures consistently higher than Top-1 yet lower than Top-5 which shows a high correlation between the three metrics, the paper concludes that multi-label metrics measures a semantically more meaningful notion of accuracy compared to its counterparts.

== Human Accuracy Measurement Process ==
=== Bias Control ===
Since three participants participated in the initial round of annotation, they did not look at the data for six months, and two additional annotators are introduced in the final evaluation phase to ensure fairness of the experiment.

=== Human Labeler Training ===
The three main difficulties encountered during human labeler training are fine-grained distinctions, class unawareness, and insufficient training images. Thus, three training regimens are provided to address the problems listed above, respectively. First, labelers will be assigned extra training tasks with immediate feedbacks on similar classes. Second, labelers will be provided access to search for specific classes during labeling. Finally, the training set will contain a reasonable amount of images for each class.

=== Labeling Guide ===
A labeling guide is constructed to distill class analysis learned during training into discriminative traits that could be used as a reference during the final labeling evaluation.

=== Final Evaluation and Review ===
Two samples, each containing 1000 images, are sampled from ImageNet and ImageNetV2, respectively, They are sampled in a class-balanced manner and shuffled together. Over 28 days, all five participants labeled all images. They spent a median of 26 seconds per image. After labeling is completed, an additional multi-label annotation session was conducted, in which human predictions for all images are manually reviewed. Comparing to the initial round of labeling, 37% of the labels changes due to participants' greater familiarity with the classes.

== Main Results ==
[[File:Evaluating Machine Accuracy on ImageNet Figure 1.png | center]]

<div align="center">Figure 1</div>

===Comparison of Human and Machine Accuracies on Image Net===
From Figure 1, we can see that the difference in accuracies between the datasets is within 1% for all human participants. As hypothesized, human testers indeed performed better than the automated models on both datasets. It's worth noticing that labelers D and E, who did not participate in the initial annotation period, actually performed better than the best automated model.
===Comparison of Human and Machine Accuracies on Image Net===
Based on the results shown in Figure 1, we can see that the confidence interval of the best 4 human participants and 4 best model overlap; however, with a p-value of 0.037 using the McNemar's paired test, it rejects the hypothesis that the FixResNeXt model and Human E labeler have the same accuracy with respect to the ImageNet validation dataset. Figure 1 also shows that the confidence intervals of the labeling accuracies for human labelers C, D, E do not overlap with the confidence interval of the best model with respect to ImageNet-V2 and with the McNemar's test yielding a p-value of <math>2\times 10^{-4}</math>, it is clear that the hypothesis human and machined models have same robustness to model distribution shifts ought to be rejected.

== Other Observations ==

[[File: Results_Summary_Table.png| 800px|center]]

=== Difficult Images ===

The experiment also shed some light on images that are difficult to label. 10 images were misclassified by all of the human labelers. Among those 10 images, there was 1 image of a monkey and 9 of dogs. In addition, 27 images, with 19 in object classes and 8 in organism classes, were misclassified by all 72 machine learning models in this experiment. Only 2 images were labeled wrong by all human labelers and models. Both images contained dogs. Researchers also noted that difficult images for models are mostly images of objects and exclusively images of animals for human labelers.

=== Accuracies without dogs ===

As previously discussed in the paper, machine learning models tend to outperform human labelers when classifying the 118 dog classes. To better understand to what extent does models outperform human labelers, researchers computed the accuracies again by excluding all the dog classes. Results showed a 0.6% increase in accuracy on the ImageNet images using the best model and a 1.1% increase on the ImageNet V2 images. In comparison, the mean increases in accuracy for human labelers are 1.9% and 1.8% on the ImageNet and ImageNet V2 images respectively. Researchers also conducted a simulation to demonstrate that the increase in human labeling accuracy on non-dog images is significant. This simulation was done by bootstrapping to estimate the changes in accuracy when only using data for the non-dog classes, and simulation results show smaller increases than in the experiment.

In conclusion, it's more difficult for human labelers to classify images with dogs than it is for machine learning models.

=== Accuracies on objects ===
Researchers also computed machine and human labelers' accuracies on a subset of data with only objects, as opposed to organisms, to better illustrate the differences in performance. This test involved 590 object classes. As shown in the table above, there is a 3.3% and 3.4% increase in mean accuracies for human labelers on the ImageNet and ImageNet V2 images. In contrast, there is a 0.5% decrease in accuracy for the best model on both ImageNet and ImageNet V2. This indicates that human labelers are much better at classifying objects than these models are.

=== Accuracies on fast images ===
Unlike the CNN models, human labelers spent different amounts of time on different images, spanning from several seconds to 40 minutes. To further analyze the images that take human labelers less time to classify, researchers took a subset of images with median labeling time spent by human labelers of at most 60 seconds. These images were referred to as "fast images". There are 756 and 714 fast images from ImageNet and ImageNet V2 respectively, out of the total 2000 images used for evaluation. Accuracies of models and humans on the fast images increased significantly, especially for humans.

This result suggests that human labelers know when an image is difficult to label and would spend more time on it. It also shows that the models are more likely to correctly label images that human labelers can label relatively quickly.

== Related Work ==

=== Human accuracy on ImageNet ===

Russakovsky et al. (2015) studied two trained human labelers' accuracies on 1500 and 258 images in the context of the ImageNet challenge. The top-5 accuracy of the labeler who labeled 1500 images was the well-known human baseline on ImageNet.

As introduced before, the researchers went beyond by using multi-label accuracy, using more labelers, and focusing on robustness to small distribution shifts. Although the researchers had some different findings, some results are also consistent with results from (Russakovsky et al., 2015). An example is that both experiments indicated that it takes human labelers around one minute to label an image. The time distribution also has a long tail, due to the difficult images as mentioned before.

=== Human performance in computer vision broadly ===
There are many examples of recent studies about humans in the area of computer vision, such as investigating human robustness to synthetic distribution change (Geirhos et al., 2017) and studying what characteristics do humans use to recognize objects (Geirhos et al., 2018). Other examples include the adversarial examples constructed to fool both machines and time-limited humans (Elsayed et al., 2018) and illustrating foreground/background objects' effects on human and machine performance (Zhu et al., 2016).

=== Multi-label annotations ===
Stock & Cissé (2017) also studied ImageNet's multi-label nature, which aligns with the researchers' study in this paper. According to Stock & Cissé (2017), the top-1 accuracy measure could underestimate multi-label by up to 13.2%.

=== ImageNet inconsistencies and label error ===
Researches have found and recorded some incorrectly labeled images from ImageNet and ImageNet V2 during this study. Earlier studies (Van Horn et al., 2015) also shown that at least 4% of the birds in ImageNet are misclassified. This work also noted that the inconsistent taxonomic structure in birds' classes could lead to weak class boundaries. Researchers also noted that the majority of the fine-grained organism classes also had similar taxonomic issues.

=== Distribution shift ===
There has been an increasing amount of studies in this area. One focus of the studies is distributionally robust optimization (DRO), which finds the model that has the smallest worst-case expected error over a set of probability distributions. Another focus is on finding the model with the lowest error rates on adversarial examples. Work in both areas has been productive, but none was shown to resolve the drop in accuracies between ImageNet and ImageNet V2. A recent [https://papers.nips.cc/paper/2019/file/8558cb408c1d76621371888657d2eb1d-Paper.pdf paper] also discusses quantifying uncertainty under a distribution shift, in other words whether the output of probabilistic deep learning models should or should not be trusted.

== Conclusion and Future Work ==

=== Conclusion ===
Researchers noted that in order to achieve truly reliable machine learning, researchers need a deeper understanding of the range of parameters where the model still remain robust. Techniques from Combinatorics and sensitivity analysis, in particular, might yield fruitful results. This study has provided valuable insights into the desired robustness properties by comparing model performance to human performance. This is especially evident given the results of the experiment which show humans drastically outperforming machine learning in many cases and proposes the question of how much accuracy one is willing to give up in exchange for efficiency. The results have shown that current performance benchmarks are not addressing the robustness to small and natural distribution shifts, which are easily handled by humans.

=== Future work ===
Other than improving the robustness of models, researchers should consider investigating if less-trained human labelers can achieve a similar level of robustness to distributional shifts. In addition, researchers can study the robustness to temporal changes, which is another form of natural distribution shift (Gu et al., 2019; Shankar et al., 2019). Also, Convolutional Neural Network can be a candidate to improve the accuracy of classifying images.

== Critiques ==

# Table 1 simply showed a difference in ImageNet multi-label accuracy yet does not give an explicit reason as to why such a difference is present. Although the paper suggested the distribution shift has caused the difference, it does not give other factors to concretely explain why the distribution shift was the cause.
# With the recommendation to future machine evaluations, the paper proposed to "Report performances on dogs, other animals, and inanimate objects separately.". Despite its intentions, it is narrowly specific and requires further generalization for it to be convincing.
# With choosing human subjects as samplers, no further information was given as to how they are chosen nor there are any background information was given. As it is a classification problem involving many classes as specific to species, a biology student would give far more accurate results than a computer science student or a math student.
# As explaining the importance of multi-label metrics using comparison to Top-5 metric, the turtle example falls within the overall similarity (simony) classification of the multi-label evaluation metric, as such, if the Top-5 evaluation suggests any one of the turtle species were selected, the algorithm is considered to produce a correct prediction which is the intention. The example does not convey the necessity of changing to the proposed metric over the Top-5 metric.
# With the definition in the paper regarding multi-label metrics, it is hard to see why expanding the label set is different from a traditional Top-5 metric or rather necessary, ergo does not yield the claim which the proposed metric is necessary for rigorous accuracy evaluation on ImageNet.
# When discussing the main results, the paper discusses the hypothesis on distribution shift having no effects on human and machine model accuracies; the presentation is poor at best with no clear centric to what they are trying to convey to how (in detail) they resulted in such claims.
# In the experiment setup of the presentation, there are a lot of key terms without detailed description. For example, Human labeler training using a subset of the remaining 30,000 unannotated images in the ImageNet validation set, labelers A, B, C, D, and E underwent extensive training to understand the intricacies of fine-grained class distinctions in the ImageNet class hierarchy. Authors should clarify each key term in the presentation otherwise readers are hard to follow.
# Not sure how the human samplers were determined and simply picking several people will have really high bias because the sample is too small and they have different background which will definitely affect the results a lot. Also, it will be better if there are more comparisons between the model introduced and other models.
# Given the low amount of human participants, it is hard to take the results seriously (there is too much variance). Also it's not exactly clear how the authors determined that the multi-label accuracy metric measures a semantically more meaningful notion of accuracy compared to its counterparts. For example, one of the issues with top-5 accuracy that they mention is: "For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations." But it's not clear how multi-label accuracy would be better in this instance.
# It is unclear how well the human labeler can perform labeling after training. So the final result is not that trust-worthy.
# In this experiment set up, label annotators are the same as participants of the experiments. Even if there's a break between the annotating and evaluating human labeler evaluation, the impact of the break in reducing bias is not clear. One potential human labeling data is google's "I'm not a robot" verification test. One variation of the verification test asks users to select all the photos from 9 images that are related to a certain keyword. This allows for a more accurate measurement of human performance vs ImageNet performance. In addition, it's going to reduce the biases from the small number of experiment participants.

Surround Vehicle Motion Prediction

2020-12-03T20:00:04Z

D5cui: /* Motion predictor */

DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.

== Previous Work ==
The autonomous vehicle trajectory approaches previously used motion models like Constant Velocity and Constant Acceleration. These models are linear and are only able to handle straight motions. There are curvilinear models such as Constant Turn Rate and Velocity and Constant Turn Rate and Acceleration which handle rotations and more complex motions. Together with these models, Kalman Filter is used to predicting the vehicle trajectory. Kalman filtering is a common technique used in sensor fusion for state estimation that allows the vehicle's state to be predicted while taking into account the uncertainty associated with inputs and measurements. However, the performance of the Kalman Filter in predicting multi-step problems is not that good. Recurrent Neural Network performs significantly better than it.

There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models.

Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].

== Motivation ==
Research results indicate that little research has been dedicated on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behaviour at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Sensor Outputs ===

The input of the target perceptions is from the output of the sensors. The data collected in this article uses 6 different sensors with feature fusion to detect traffic in the range up to 100m: 1) LiDAR system outputs: Relative position, heading, velocity, and box size in local coordinates; 2) Around0View Monitoring (AVM) and 3)GPS outputs: acquire lanes, road marker, global position; 4) Gateway engine outputs: precise global position in urban road environment; 5) Micro-Autobox II and 6) a MDPS are used to control and actuate the subject. All data are stored in an industrial PC.

=== Data ===
Multi-lane turn intersections are the target roads in this paper. The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.

=== Motion predictor ===
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement, which is the sequential previous motion. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.

==== Encoder and decoder ====
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Sequence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Where <math>u_{min}</math>, <math>u_{max}</math>and S are the minimum/maximum control input and maximum slew rate of input respectively.

Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}
<center>[[Image:Figure10.1_YanYu.png|500px|]]</center>

The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== Motion planning application ===
==== Case study of a multi-lane left turn scenario ====
The proposed method mimics a human driver better, by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
the vehicle, even when the target vehicle was not following the intersection guideline.

==== Statistical analysis of motion planning application results ====
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behaviour of the subject vehicle.

<center>[[Image:Figure11_YanYu.png|500px|]]</center>

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math>
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed
model is efficient and safe.

== Conclusion ==
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.

There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.

It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.

The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.

It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.

The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.

Understandably, a supervised learning problem should be evaluated on some test dataset. However, supervised learning techniques are inherently ill-suited for general planning problems. The test dataset was obtained from human driving data which is known to be extremely noisy as well as unpredictable when it comes to motion planning. It would be crucial to determine the successes of this paper based on the state-of-the-art reinforcement learning techniques.

It would be better if the authors compared their method against other SOTA methods. Also one of the reasons motion planning is done using interpretable methods rather than black boxes (such as this model) is because it is hard to see where things go wrong and fix problems with the black box when they occur - this is something the authors should have also discussed.

A future area of study is to combine other source of information such as signals from Lidar or car side cameras to make a better prediction model.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].

[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.

Neural Speed Reading via Skim-RNN

2020-12-01T14:50:39Z

D5cui: /* Results */

== Group ==

Mingyan Dai, Jerry Huang, Daniel Jiang

== Introduction ==

Recurrent Neural Network (RNN) is a class of artificial neural networks where the connection between nodes form a directed graph along with time series and has time dynamic behavior. RNN is derived from a feedforward neural network and can use its memory to process variable-length input sequences. This makes it suitable for tasks such as unsegmented, connected handwriting recognition, and speech recognition.

In Natural Language Processing, Recurrent Neural Network (RNN) is a common architecture used to sequentially ‘read’ input tokens and output a distributed representation for each token. By recurrently updating the hidden state of the neural network, an RNN can inherently require the same computational cost across time. However, when it comes to processing input tokens, some tokens, when being compared to others, are less important to the overall representation of a piece of text or query. For example, in the application of RNN to the question answering problem, it is not uncommon to encounter parts of a passage that are irrelevant to answering the query.

LSTM-Jump (Yu et al., 2017), a variant of LSTMs, was introduced to improve efficiency by skipping multiple tokens at a given step. Skip-RNN skips multiple steps and jumps based on RNNs’ current hidden state. In contrast, Skim-RNN takes advantage of 'skimming' rather than 'skipping tokens'. Skim-RNN predicts each word as important or unimportant. A small RNN is used if the word is not important, and a large RNN is used if the word is important. This paper demonstrates that skimming achieves higher accuracy compared to skipping tokens, implying that paying attention to unimportant tokens is better than completely ignoring them.

== Model ==

In this paper, the authors introduce a model called 'skim-RNN', which takes advantage of ‘skimming’ less important tokens or pieces of text rather than ‘skipping’ them entirely. This models the human ability to skim through passages, or to spend less time reading parts that do not affect the reader’s main objective. While this leads to a loss in the comprehension rate of the text [1], it greatly reduces the amount of time spent reading by not focusing on areas that will not significantly affect efficiency when it comes to the reader's objective.

'Skim-RNN' works by rapidly determining the significance of each input and spending less time processing unimportant input tokens by using a smaller RNN to update only a fraction of the hidden state. When the decision is to ‘fully read’, that is to not skim the text, Skim-RNN updates the entire hidden state with the default RNN cell. Since the hard decision function (‘skim’ or ‘read’) is non-differentiable, the authors use a gumbel-softmax [2] to estimate the gradient of the function, rather than traditional methods such as REINFORCE (policy gradient)[3]. Gumbel-softmax provides a differentiable sample compared to the non-differentiable sample of a categorical distribution [2]. The switching mechanism between the two RNN cells enables Skim-RNN to reduce the total number of float operations (Flop reduction, or Flop-R). A high skimming rate often leads to faster inference on CPUs, which makes it very useful for large-scale products and small devices.

The Skim-RNN has the same input and output interfaces as standard RNNs, so it can be conveniently used to speed up RNNs in existing models. In addition, the speed of Skim-RNN can be dynamically controlled at inference time by adjusting a parameter for the threshold for the ‘skim’ decision.

=== Related Works ===

As the popularity of neural networks has grown, significant attention has been given to make them faster and lighter. In particular, relevant work focused on reducing the computational cost of recurrent neural networks has been carried out by several other related works. For example, LSTM-Jump (You et al., 2017) [8] models aim to speed up run times by skipping certain input tokens, as opposed to skimming them. Choi et al. (2017)[9] proposed a model that uses a CNN-based sentence classifier to determine the most relevant sentence(s) to the question and then uses an RNN-based question-answering model. This model focuses on reducing GPU run-times (as opposed to Skim-RNN which focuses on minimizing CPU-time or Flop), and is also focused only on question answering.

=== Implementation ===

A Skim-RNN consists of two RNN cells, a default (big) RNN cell of hidden state size <math>d</math> and a small RNN cell of hidden state size <math>d'</math>, where <math>d</math> and <math>d'</math> are parameters defined by the user and <math>d' \ll d</math>. This follows the fact that there should be a small RNN cell defined for when text is meant to be skimmed and a larger one for when the text should be processed as normal.

Each RNN cell will have its own set of weights and bias as well as be any variant of an RNN. There is no requirement on how the RNN itself is structured, rather the core concept is to allow the model to dynamically make a decision as to which cell to use when processing input tokens. Note that skipping text can be incorporated by setting <math>d'</math> to 0, which means that when the input token is deemed irrelevant to a query or classification task, nothing about the information in the token is retained within the model.

Experimental results suggest that this model is faster than using a single large RNN to process all input tokens, as the smaller RNN requires fewer floating-point operations to process the token. Additionally, higher accuracy and computational efficiency are achieved.

==== Inference ====

At each time step <math>t</math>, the Skim-RNN unit takes in an input <math>{\bf x}_t \in \mathbb{R}^d</math> as well as the previous hidden state <math>{\bf h}_{t-1} \in \mathbb{R}^d</math> and outputs the new state <math>{\bf h}_t </math> (although the dimensions of the hidden state and input are the same, this process holds for different sizes as well). In the Skim-RNN, there is a hard decision that needs to be made whether to read or skim the input, although there could be potential to include options for multiple levels of skimming.

The decision to read or skim is done using a multinomial random variable <math>Q_t</math> over the probability distribution of choices <math>{\bf p}_t</math>, where

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>{\bf p}_t = \text{softmax}(\alpha({\bf x}_t, {\bf h}_{t-1})) = \text{softmax}({\bf W}[{\bf x}_t; {\bf h}_{t-1}]+{\bf b}) \in \mathbb{R}^k</math>
</div>

where <math>{\bf W} \in \mathbb{R}^{k \times 2d}</math>, <math>{\bf b} \in \mathbb{R}^{k}</math> are weights to be learned and <math>[{\bf x}_t; {\bf h}_{t-1}] \in \mathbb{R}^{2d}</math> indicates the row concatenation of the two vectors. In this case, <math> \alpha </math> can have any form as long as the complexity of calculating it is less than <math> O(d^2)</math>. Letting <math>{\bf p}^1_t</math> indicate the probability for fully reading and <math>{\bf p}^2_t</math> indicate the probability for skimming the input at time <math> t</math>, it follows that the decision to read or skim can be modelled using a random variable <math> Q_t</math> by sampling from the distribution <math>{\bf p}_t</math> and
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>Q_t \sim \text{Multinomial}({\bf p}_t)</math>
</div>

Without loss of generality, we can define <math> Q_t = 1</math> to indicate that the input will be read while <math> Q_t = 2</math> indicates that it will be skimmed. Reading requires applying the full RNN on the input as well as the previous hidden state to modify the entire hidden state while skimming only modifies part of the prior hidden state.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
{\bf h}_t = \begin{cases}
f({\bf x}_t, {\bf h}_{t-1}) & Q_t = 1\\
[f'({\bf x}_t, {\bf h}_{t-1});{\bf h}_{t-1}(d'+1:d)] & Q_t = 2
\end{cases}
</math>
</div>

where <math> f </math> is a full RNN with output of dimension <math>d</math> and <math>f'</math> is a smaller RNN with <math>d'</math>-dimensional output. This has advantage that when the model decides to skim, then the computational complexity of that step is only <math>O(d'd)</math>, which is much smaller than <math>O(d^2)</math> due to previously defining <math> d' \ll d</math>.

==== Training ====

Since the expected loss/error of the model is a random variable that depends on the sequence of random variables <math> \{Q_t\} </math>, the loss is minimized with respect to the distribution of the variables. Defining the loss to be minimized while conditioning on a particular sequence of decisions

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
L(\theta\vert Q)
</math>
</div>
where <math>Q=Q_1\dots Q_T</math> is a sequence of decisions of length <math>T</math>, then the expected loss over the distribution of the sequence of decisions is

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
\mathbb{E}[L(\theta)] = \sum_{Q} L(\theta\vert Q)P(Q) = \sum_Q L(\theta\vert Q) \Pi_j {\bf p}_j^{Q_j}
</math>
</div>

Since calculating <math>\delta \mathbb{E}_{Q_t}[L(\theta)]</math> directly is rather infeasible, it is possible to approximate the gradients with a gumbel-softmax distribution [2]. Reparameterizing <math> {\bf p}_t</math> as <math> {\bf r}_t</math>, then the back-propagation can flow to <math> {\bf p}_t</math> without being blocked by <math> Q_t</math> and the approximation can arbitrarily approach <math> Q_t</math> by controlling the parameters. The reparameterized distribution is therefore

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
{\bf r}_t^i = \frac{\text{exp}(\log({\bf p}_t^i + {g_t}^i)/\tau)}{\sum_j\text{exp}(\log({\bf p}_t^j + {g_t}^j)/\tau)}
</math>
</div>

where <math>{g_t}^i</math> is an independent sample from a <math>\text{Gumbel}(0, 1) = -\log(-\log(\text{Uniform}(0, 1))</math> random variable and <math>\tau</math> is a parameter that represents a temperature. Then it can be rewritten that

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
{\bf h}_t = \sum_i {\bf r}_t^i {\bf \tilde{h}}_t
</math>
</div>

where <math>{\bf \tilde{h}}_t</math> is the previous equation for <math>{\bf h}_t</math>. The temperature parameter gradually decreases with time, and <math>{\bf r}_t^i</math> becomes more discrete as it approaches 0.

A final addition to the model is to encourage skimming when possible. Therefore an extra term related to the negative log probability of skimming and the sequence length. Therefore the final loss function used for the model is denoted by

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
L'(\theta) =L(\theta) + \gamma \cdot\frac{1}{T} \sum_i -\log({\bf \tilde{p}}^i_t)
</math>
</div>
where <math> \gamma </math> is a parameter used to control the ratio between the main loss function and the negative log probability of skimming.

== Experiment ==

The effectiveness of Skim-RNN was measured in terms of accuracy and float operation reduction on four classification tasks and a question-answering task. These tasks were chosen because they do not require one’s full attention to every detail of the text, but rather ask for capturing the high-level information (classification) or focusing on a specific portion (QA) of the text, which a common context for speed reading. The tasks themselves are listed in the table below.

[[File:Table1SkimRNN.png|center|1000px]]

=== Classification Tasks ===

In a language classification task, the input was a sequence of words and the output was the vector of categorical probabilities. Each word is embedded into a <math>d</math>-dimensional vector. We initialize the vector with GloVe [4] to form representations of the words and use those as the inputs for a long short-term memory (LSTM) architecture. A linear transformation on the last hidden state of the LSTM and then a softmax function was applied to obtain the classification probabilities. Adam [5] was used for optimization, with an initial learning rate of 0.0001. For Skim-LSTM, <math>\tau = \max(0.5, exp(−rn))</math> where <math>r = 1e-4</math> and <math>n</math> is the global training step, following [2]. We experiment on different sizes of big LSTM (<math>d \in \{100, 200\}</math>) and small LSTM (<math>d' \in \{5, 10, 20\}</math>) and the ratio between the model loss and the skim loss (<math>\gamma\in \{0.01, 0.02\}</math>) for Skim-LSTM. The batch sizes used were 32 for SST and Rotten Tomatoes, and 128 for others. For all models, early stopping was used when the validation accuracy did not increase for 3000 global steps.

==== Results ====

[[File:Table2SkimRNN.png|center|1000px]]

[[File:Figure2SkimRNN.png|center|1000px]]

Table 2 shows the accuracy and computational cost of the Skim-RNN model compared with other standard models. It is evident that the Skim-RNN model produces a speed-up on the computational complexity of the task while maintaining a high degree of accuracy. Also, it is interesting to know that the accuracy improvement over LSTM could be due to the increased stability of the hidden state, as the majority of the hidden state is not updated when skimming. Meanwhile, figure 2 demonstrates the effect of varying the size of the small hidden state as well as the parameter <math>\gamma</math> on the accuracy and computational cost.

[[File:Table3SkimRNN.png|center|1000px]]

Table 3 shows an example of a classification task over a IMDb dataset, where Skim-RNN with <math>d = 200</math>, <math>d' = 10</math>, and <math>\gamma = 0.01</math> correctly classifies it with a high skimming rate (92%). The goal was to classify the review as either positive or negative. The black words are skimmed, and the blue words are fully read. The skimmed words are clearly irrelevant and the model learns to only carefully read the important words, such as ‘liked’, ‘dreadful’, and ‘tiresome’.

=== Question Answering Task ===

In the Stanford Question Answering Dataset, the task was to locate the answer span for a given question in a context paragraph. The effectiveness of Skim-RNN for SQuAD was evaluated using two different models: LSTM+Attention and BiDAF [6]. The first model was inspired by most then-present QA systems consisting of multiple LSTM layers and an attention mechanism. This type of model is complex enough to reach reasonable accuracy on the dataset and simple enough to run well-controlled analyses for the Skim-RNN. The second model was an open-source model designed for SQuAD, used primarily to show that Skim-RNN could replace RNN in existing complex systems.

==== Training ====

Adam was used with an initial learning rate of 0.0005. For stable training, the model was pre-trained with a standard LSTM for the first 5k steps, and then fine-tuned with Skim-LSTM.

==== Results ====

[[File:Table4SkimRNN.png|center|1000px]]

Table 4 shows the accuracy (F1 and EM) of LSTM+Attention and Skim-LSTM+Attention models as well as VCRNN [7]. It can be observed from the table that the skimming models achieve higher or similar accuracy scores compared to the non-skimming models while also reducing the computational cost by more than 1.4 times. In addition, decreasing layers (1 layer) or hidden size (<math>d=5</math>) improved the computational cost but significantly decreases the accuracy compared to skimming. The table also shows that replacing LSTM with Skim-LSTM in an existing complex model (BiDAF) stably gives reduced computational cost without losing much accuracy (only 0.2% drop from 77.3% of BiDAF to 77.1% of Sk-BiDAF with <math>\gamma = 0.001</math>).

An explanation for this trend that was given is that the model is more confident about which tokens are important in the second layer. Second, higher <math>\gamma</math> values lead to a higher skimming rate, which agrees with its intended functionality.

Figure 4 shows the F1 score of LSTM+Attention model using standard LSTM and Skim LSTM, sorted in ascending order by Flop-R (computational cost). While models tend to perform better with larger computational cost, Skim LSTM (Red) outperforms standard LSTM (Blue) with a comparable computational cost. It can also be seen that the computational cost of Skim-LSTM is more stable across different configurations and computational cost. Moreover, increasing the value of <math>\gamma</math> for Skim-LSTM gradually increases the skipping rate and Flop-R, while it also led to reduced accuracy.

=== Runtime Benchmark ===

[[File:Figure6SkimRNN.png|center|1000px]]

The details of the runtime benchmarks for LSTM and Skim-LSTM, which are used to estimate the speedup of Skim-LSTM-based models in the experiments, are also discussed. A CPU-based benchmark was assumed to be the default benchmark, which has a direct correlation with the number of float operations that can be performed per second. As mentioned previously, the speed-up results in Table 2 (as well as Figure 7) are benchmarked using Python (NumPy), instead of popular frameworks such as TensorFlow or PyTorch.

Figure 7 shows the relative speed gain of Skim-LSTM compared to standard LSTM with varying hidden state size and skim rate. NumPy was used, with the inferences run on a single thread of CPU. The ratio between the reduction of the number of float operations (Flop-R) of LSTM and Skim-LSTM was plotted, with the ratio acting as a theoretical upper bound of the speed gain on CPUs. From here, it can be noticed that there is a gap between the actual gain and the theoretical gain in speed, with the gap being larger with more overhead of the framework or more parallelization. The gap also decreases as the hidden state size increases because the overhead becomes negligible with very large matrix operations. This indicates that Skim-RNN provides greater benefits for RNNs with larger hidden state size. However, combining Skim-RNN with a CPU-based framework can lead to substantially lower latency than GPUs.

== Results ==

The results clearly indicate that the Skim-RNN model provides features that are suitable for general reading tasks, which include classification and question answering. While the tables indicate that minor losses in accuracy occasionally did result when parameters were set at specific values, they were minor and were acceptable given the improvement in runtime.

'''Controlling skim rate:''' An important advantage of Skim-RNN is that the skim rate (and thus computational cost) can be dynamically controlled at inference time by adjusting the threshold for
‘skim’ decision probability <math>{\bf p}^1_t</math>. Figure 5 shows the trade-off between the accuracy and computational cost for two settings, confirming the importance of skimming (<math>d' > 0</math>) compared to skipping (<math>d' = 0</math>).

'''Visualization:''' Figure 6 shows that the model does not skim when the input seems to be relevant to answering the question, which was as expected by the design of the model. In addition, the LSTM in the second layer skims more than that in the first layer mainly because the second layer is more confident about the importance of each token.

== Conclusion ==

A Skim-RNN can offer better latency results on a CPU compared to a standard RNN on a GPU, with lower computational cost, as demonstrated through the results of this study. Compared to RNN, Skim-RNN takes the advantage of "skimming" rather than "reading", spends less time on parts of the input that is unimportant. Future work (as stated by the authors) involves using Skim-RNN for applications that require much higher hidden state size, such as video understanding, and using multiple small RNN cells for varying degrees of skimming. Further, since it has the same input and output interface as a regular RNN, it can replace RNNs in existing applications.

== Critiques ==

1. It seems like Skim-RNN is using the not full RNN of processing words that are not important, thus it can increase speed in some very particular circumstances (ie, only small networks). The extra model complexity did slow down the speed while trying to "optimizing" the efficiency and sacrifice part of accuracy while doing so. It is only trying to target a very specific situation (classification/question-answering) and made comparisons only with the baseline LSTM model. It would be definitely more persuasive if the model can compare with some of the state of art neural network models.

2. This model of Skim-RNN is pretty good to extract binary classification type of text, thus it would be interesting for this to be applied to stock market news analysis. For example, a press release from a company can be analyzed quickly using this model and immediately give the trader a positive or negative summary of the news. Would be beneficial in trading since time and speed is an important factor when executing a trade.

3. An appropriate application for Skim-RNN could be customer service chatbots as they can analyze a customer's message and skim associated company policies to craft a response. In this circumstance, quickly analyzing text is ideal to not waste customers' time.

4. This could be applied to news apps to improve readability by highlighting important sections.

5. This summary describes an interesting and useful model that can save readers time for reading an article. I think it will be interesting that discuss more on training a model by Skim-RNN to highlight the important sections in very long textbooks. As a student, having highlights in the textbook is really helpful to study. But highlight the important parts in a time-consuming work for the author, maybe using Skim-RNN can provide a nice model to do this job.

6. Besides the good training performance of Skim-RNN, it's good to see the algorithm even performs well simply by training with CPU. It would make it possible to perform the result on lite-platforms.

7. Another good application of Skim-RNN could be in reading the terms and conditions of websites. A lot of the terms and conditions documents tend to be painfully long and majority of customers because of the length of the document tend to sign them without reading them. Since these websites can compromise the personal information of the customers by giving it to third parties, it is worthwhile for the customers to know at least what they are signing. Infact, this can also be applied to read important legal official documents like lease agreements etc.

8. Another [https://arxiv.org/abs/1904.00761 paper], written by Christian Hansen et al., also discussed Neural Speed Reading. However, conducting Neural Speed reading via a Skim-RNN, this paper suggested using a Structural-Jump-LSTM. The Structural-Jump-LSTM can both skip and jump test during inference. This model consisted of a standard LSTM with 2 agents: 1 capable of bypassing single words and another capable of bypassing punctuation.

9. Another method of increasing the speed of human reading is to quickly flash each of the words in an article in quick succession in the middle of the screen. However, it is important that relative duration each word spends on the screen is adjusted, so that a human reader will feel natural reading it. Could the skimming results from this article be incorporated into the display-one-word-at-a-time method to even further increase the speed one can read an article? Would adding color to the unskimmed words also help increase reading speed?

10. It is interesting how the paper mentions the correlation between a system’s performance in a given reading-based task and the speed of reading. This conclusion was reached by the human reading pattern with neural attention in an unsupervised learning approach.

11. With the comparison in figure 7, skim-LSTM has a definite advantage over the original LSTM model on the speed which give us an inspiration that we can boost the speed of original LSTM model used in daily life applications such as stock forecasting.

== Applications ==

Recurrent architectures are used in many other applications:

1. '''Real-time video processing''' is an exceedingly demanding and resource-constrained task, particularly in edge settings.

2. '''Music Recommend system''' is which many music platforms such as Spotify and Pandora used for recommending music based on the user's information.

3. '''Speech recognition''' enables the recognition and translation of spoken language into text by computers.

It would be interesting to see if this method could be applied to those cases for more efficient inference, such as on drones or self-driving cars. Another possible application is real-time edge processing of game video for sports arenas.

== References ==

[1] Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.

[2] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.

[3] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.

[4] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.

[5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[6] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017a.

[7] Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. In ICLR, 2017.

[8] Adams Wei Yu, Hongrae Lee, and Quoc V Le. Learning to skim text. In ACL, 2017.

[9] Eunsol Choi, Daniel Hewlett, Alexandre Lacoste, Illia Polosukhin, Jakob Uszkoreit, and Jonathan Berant. Coarse-to-fine question answering for long documents. In ACL, 2017.

Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence

2020-12-01T14:02:49Z

D5cui: /* Objective Function */

== Presented by ==
Guanting(Tony) Pan, Zaiwei(Zara) Zhang, Haocheng(Beren) Chang

== Introduction ==
With the development of mobile devices and location-acquisition technologies, accessing real-time location information is becoming easier and more efficient. Precisely because of this development, Location-based Social Networks (LBSNs) like Yelp and Foursquare have become an important part of human life. People can share their experiences in locations, such as restaurants and parks, on the Internet. These locations can be seen as a Point-of-Interest (POI) in software such as Maps on our phone. These large amounts of user-POI interaction data can be used to provide a service, which is called personalized POI recommendation, to give recommendations to users of a location they might be interested in. These large amounts of data can be used to train a model through Machine Learning methods(i.e., Classification, Clustering, etc.) to predict a POI that users might be interested in. The POI recommendation system still faces some challenging issues: (1) the difficulty of modeling complex user-POI interactions from sparse implicit feedback; (2) the difficulty of incorporating geographic background information. In order to meet these challenges, this paper will introduce a novel autoencoder-based model to learn non-linear user-POI relations, which is called SAE-NAD. SAE stands for the self-attentive encoder, while NAD stands for the neighbor-aware decoder. An autoencoder is an unsupervised learning technique that we implement in the neural network model for representation learning, meaning that our neural network will contain a "bottleneck" layer that produces a compressed knowledge representation of the original input. This method will utilize machine learning knowledge that we learned in this course.

== Previous Work ==

In the previous works, the method is just equally treating users checked in POIs. The drawback of equally treating users checked in POIs is that valuable information about the similarity between users is not utilized, thus reducing the power of such recommenders. However, the SAE adaptively differentiates user preference degrees in multiple aspects.

Previous methods mainly used a process called collaborative filtering which can be divided into memory-based methods and model-based methods. Collaborative filtering makes recommendations from historical user-system interactions like user’s feedback or browsing history. Content-based and hybrid recommendation systems are also commonly used. Content-based recommendation system compares users’ information like texts, videos and images. Hybrid model combines two or more recommendation systems. Memory-based methods predict a user preference based on a weighted average of similar users or POIs. Model-based methods use user-POI data to build a model for generating recommendations. Both methods typically model user preferences linearly, which may be an oversimplification.

There are some other personalized POI recommendation methods that can be used. Some famous software (e.g., Netflix) uses model-based methods that are built on matrix factorization (MF). For example, ranked based Geographical Factorization Method in [1] adopted weighted regularized MF to serve people on POI. Machine learning is popular in this area. POI recommendation is an important topic in the domain of recommender systems [4]. This paper also described related work in Personalized location recommendation and attention mechanism in the recommendation. The recent studies on location recommendation methods using historical data (check-ins, comments, etc.)

== Motivation ==
This paper reviews encoder and decoder. A single hidden-layer autoencoder is an unsupervised neural network, which consists of two parts: an encoder and a decoder. The encoder has one activation function that maps the input data to the latent space. The decoder also has one activation function mapping the representations in the latent space to the reconstruction space. And here is the formula:

[[File: formula.png|center]](Note: a is the activation function)

The proposed method uses a two-layer neural network to compute the score matrix in the architecture of the SAE. The NAD adopts the RBF kernel to make checked-in POIs exert more influence on nearby unvisited POIs. To train this model, Network training is required.

This paper will use the datasets in the real world, which are from Gowalla[2], Foursquare [3], and Yelp[3]. These datasets would be used to train by using the method introduced in this paper and compare the performance of SAE-NAD with other POI recommendation methods. Three groups of methods are used to compare with the proposed method, which are traditional MF methods for implicit feedback, Classical POI recommendation methods, and Deep learning-based methods. Specifically, the Deep learning-based methods contain a DeepAE which is a three-hidden-layer autoencoder with a weighted loss function, we can connect this to the material in this course.

Autoencoders (AE), due to their ability to represent complex data, have become very useful in recommendation systems. The primary reason it is used is because with deep neural network and with non-linear activation function it effectively captures the non-linear and non-trivial relationships between users and POI's.

== Methodology ==

=== Notations ===

Here are the notations used in this paper. It will be helpful when trying to understand the structure and equations in the algorithm.
[[File:notations.JPG|500px|x300px|center]]

=== Structure ===

The structure of the network in this paper includes a self-attentive encoder as the input layer(yellow), and a neighbor-aware decoder as the output layer(green).

[[File:1.JPG|1200px|x600px]]

=== Self-Attentive Encoder ===

The self-attentive encoder is the input layer. It transfers the preference vector <math>x_u</math> to hidden representation <math>A_u</math> using weight matrix <math>W^1</math> and the activation function <math>softmax</math> and <math>tanh</math>. The 0's and 1's in <math>x_u</math> indicates whether the user has been to a certain POI. The weight matrix <math>W_a</math> assigns different weights on various features of POIs. <math>A_u</math> is the importance score matrix, where each column is a POI, and each row is the importance level of the <math>n</math> checked in POIs in one aspect.

[[File:encoder.JPG|center]]

<math>\mathbf{Z}_u^{(1)} = \mathbf{A}_u \cdot (\mathbf{W}^{(1)}[L_u])^\top</math> is the multiplication of the importance score matrix and the POI embeddings, and represents the user from <math>d_a</math> aspects. <math>\mathbf{Z}_u^{(1)}</math> is a <math>d_a \times H_1</math> matrix. The aggregation layer then combines multiple aspects that represent the user into one, so that it fits into the Neighbor-Aware decoder: <math>\mathbf{z}_u^{(1)} = a_t(\mathbf{Z}_u^{(1)\top}\mathbf{w}_t + \mathbf{b}_t)</math>.

=== Neighbor-Aware Decoder ===

POI recommendation uses the geographical clustering phenomenon, which increases the weight of the unvisited POIs that surround the visited POIs. Also, an aggregation layer is added to the network to aggregate users’ representations from different aspects into one aspect. This means that a person who has visited a location is very likely to return to this location again in the future, so the user is recommended POIs surrounding this area. An example would be someone who has been to the UW plaza and bought Lazeez, is very likely to return to the plaza, therefore the person is recommended to try Mr. Panino's Beijing House.

[[File:decoder.JPG|center]]

=== Objective Function ===

By minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with backpropagation. After that, the training is complete. By minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with backpropagation.

[[File:objective_function.JPG|center]]

== Comparative analysis ==

=== Metrics introduction ===
To obtain a comprehensive evaluation of the effectiveness of the model, the authors performed a thorough comparison between the proposed model and the existing major POI recommendation methods. These methods can be further broken down into three categories: traditional matrix factorization methods for implicit feedback, classical POI recommendation methods, and deep learning-based methods. Here, three key evaluation metrics were introduced as Precison@k, Recall@k, and MAP@k. Through comparing all models on three datasets using the above metrics, it is concluded that the proposed model achieved the best performance.

To better understand the comparison results, it is critical to understand the meanings behind each evaluation metric. Suppose the proposed model generated k recommended POIs for the user. The first metric, Precison@k, measures the percentage of the recommended POIs which the user has visited. Recall@k is also associated with the user’s behavior. However, it will measure the percentage of recommended POIs in all POIs which have been visited by the user. Lastly, MAP@k represents the mean average precision at k, where average precision is the average of precision values at all k ranks, where relevant POIs are found. They are formally defined as follows,
$$ Recall@k=\frac{1}{M} \sum_{i=1}^M \frac{S_i(k) \cap T_i}{|T_i|} \\
Precision@k = \frac{1}{M} \sum_{i=1}^M \frac{S_i(k) \cap T_i}{k} \\
MAP@k = \frac{1}{M} \sum_{i=1}^M \frac{\sum_{j=1}^k p(j) \times rel(j)}{|T_i|} \\$$

where <math>S_i(k) </math> us a set of top- <math>k </math> unvisited locations recommended to user <math> i </math> excluding those locations in the training, and <math>T_i </math> is a set of locations that are visited by user <math> i</math> in the testing. <math> p(j)</math> is the precision of a cut-off rank list from 1 to <math> j</math>, and <math>rel(j) </math> is an indicator function that equals to 1 if the location is visited in the testing, otherwise equals to 0.

=== Model Comparison ===
Among all models in the comparison group, RankGeoFM, IRNMF, and PACE produced the best results. Nonetheless, these models are still incomparable to our proposed model. The reasons are explained in details as follows:

Both RankGeoFM and IRNMF incorporate geographical influence into their ranking models, which is significant for generating POI recommendations. However, they are not capable of capturing non-linear interactions between users and POIs. In comparison, the proposed model, while incorporating geographical influence, adopts a deep neural structure which enables it to measure non-linear and complex interactions. As a result, it outperforms the two methods in the comparison group.

Moreover, compared to PACE, which is a deep learning-based method, the proposed model offers a more precise measurement of geographical influence. Though PACE is able to capture complex interactions, it models the geographical influence by a context graph, which fails to incorporate user reachability into the modeling process. In contrast, the proposed model is able to capture geographical influence directly through its neighbor-aware decoder, which allows it to achieve better performance than the PACE model.

[[File:model_comparison.JPG|center]]

== Conclusion ==
In summary, the proposed model, namely SAE-NAD, clearly showed its advantages compared to many state-of-the-art baseline methods. Its self-attentive encoder effectively discriminates user preferences on check-in POIs, and its neighbor-aware decoder measures geographical influence precisely through differentiating user reachability on unvisited POIs. By leveraging these two components together, it can generate recommendations that are highly relevant to its users.

== Critiques ==
Besides developing the model and conducting a detailed analysis, the authors also did very well in constructing this paper. The paper is well-written and has a highly logical structure. Definitions, notations, and metrics are introduced and explained clearly, which enables readers to follow through the analysis easily. Last but not least, both the abstract and the conclusion of this paper are strong. The abstract concisely reported the objectives and outcomes of the experiment, whereas the conclusion is succinct and precise.

This idea would have many applications, such as suggesting new restaurants to customers in the food delivery service app. Would the improvement in accuracy outweigh the increased complexity of the model when it comes to use in industry?

It would be nice if the authours could describe the extensive experiments on the real-world datasets, with different baseline methods and evaluation metrics, to demonstrate the effectiveness of the proposed model. Moreover, show the comparison result in tables vs other methodologies, both in terms of accuracy and time-efficiency. In addition, the drawbacks of this new methodology are unknown to the readers. In other words, how does this compare to the already established recommendation systems found in large scale applications utilize by companies like Netflix? Why use the proposed method over something simpler such as matrix factorization or collaborative filtering?

It would also be nice if the authors provided some more ablation on the various components of the proposed method. Even after reading some of their experiments, we do not have a clear understanding of how important each component is to the recommendation quality.

It is recommended that the author present how the encoder and the decoder help in the process by comparing different models with or without encoder and decoder.

It would have been better if the author included all figures to explain the sensitivity of parameters more elaborately. In the paper he only included plots showing effects on Gowalla and Foursquare datasets but not other datasets.

Some additional explanations can be inserted in the Objective Function part. From the paper, we can find <math>\lambda</math> is the regularization parameter and <math>W_a</math> and <math>w_t</math> are the learned parameters in the attention layer and aggregation layer.

It would be more attractive if there is a section to introduce the applications that based on such algorithm in daily life. For instance, which application we use nowadays is based on this algorithm and what are the advantages of it compared to other similar algorithm?

It would be useful to the readers if the authors briefly described covariates with which they are using to predict POIs. All that is detailed in this paper is that geographical information is used. Taking note of the characteristics that the data set the proposed method deals with is important, especially as phones become capabable of collecting more and more kinds of data. That is to say, the information available for predicting POI's in the future may be different to the information available today, so it is important to describe the data. (The information could even decrease in the future if privacy laws are enacted).

== References ==
[1] Defu Lian, Cong Zhao, Xing Xie, Guangzhong Sun, Enhong Chen, and Yong Rui. 2014. GeoMF: joint geographical modeling and matrix factorization for point-of-interest recommendation. In KDD. ACM, 831–840.

[2] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. 2011. Friendship and mobility: user movement in location-based social networks. In KDD. ACM, 1082–1090.

[3] Yiding Liu, Tuan-Anh Pham, Gao Cong, and Quan Yuan. 2017. An Experimental Evaluation of Point-of-interest Recommendation in Location-based Social Networks. PVLDB 10, 10 (2017), 1010–1021.

[4] Jie Bao, Yu Zheng, David Wilkie, and Mohamed F. Mokbel. 2015. Recommendations in location-based social networks: a survey. GeoInformatica 19, 3 (2015), 525–565.

[5] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. ACM, 173–182.

[6] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In ICDM. IEEE Computer Society, 263–272.

[7] Santosh Kabbur, Xia Ning, and George Karypis. 2013. FISM: factored item similarity models for top-N recommender systems. In KDD. ACM, 659–667. [12] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).

[8] Yong Liu,WeiWei, Aixin Sun, and Chunyan Miao. 2014. Exploiting Geographical
Neighborhood Characteristics for Location Recommendation. In CIKM. ACM,
739–748

[9] Xutao Li, Gao Cong, Xiaoli Li, Tuan-Anh Nguyen Pham, and Shonali Krishnaswamy.
2015. Rank-GeoFM: A Ranking based Geographical Factorization
Method for Point of Interest Recommendation. In SIGIR. ACM, 433–442.

[10] Carl Yang, Lanxiao Bai, Chao Zhang, Quan Yuan, and Jiawei Han. 2017. Bridging
Collaborative Filtering and Semi-Supervised Learning: A Neural Approach for
POI Recommendation. In KDD. ACM, 1245–1254.

Semantic Relation Classification——via Convolution Neural Network

2020-12-01T13:32:12Z

D5cui: /* Critiques */

== Presented by ==
Rui Gong, Xinqi Ling, Di Ma,Xuetong Wang

== Introduction ==
A Semantic Relation can imply a relation between different words and a relation between different sentences or phrases. For example, the pair of words "white" and "snowy" can be synonyms, while "white" and "black" can be opposites. It can be used for recommendation systems like YouTube, and understanding sentiment analysis. The study of semantic analysis involves determining the exact meaning of a text. For example, the word "date", can have different meanings in different contexts, like a calendar "date", a "date" fruit, or a romantic "date".

One of the emerging trends of natural language technologies is their use for the humanities and sciences (Gbor et al., 2018). SemEval 2018 Task 7 mainly solves the problem of relation extraction and classification of two entities in the same sentence into 6 potential relations. The 6 relations are USAGE, RESULT, MODEL-FEATURE, PART WHOLE, TOPIC, and COMPARE.

SemEval 2018 Task 7 extracted data from 350 scientific paper abstracts, which has 1228 and 1248 annotated sentences for two tasks, respectively. For each data, an example sentence was chosen with its right and left sentences, as well as an indicator showing whether the relation is reserved, then a prediction is made.

Three models were used for the prediction: Linear Classifiers, Long Short-Term Memory(LSTM), and Convolutional Neural Networks (CNN). Linear Classifier achieves the goal of classification by making a classification decision based on the value of a linear combination of the characteristics. LSTM is an artificial recurrent neural network (RNN) architecture well suited to classifying, processing and making predictions based on time series data. In the end, the prediction based on the CNN model was ultimately submitted since it performed the best among all models. By using the learned custom word embedding function, the research team added a variant of negative sampling, thereby improving performance and surpassing ordinary CNN.

== Previous Work ==
SemEval 2010 Task 8 (Hendrickx et al., 2010) explored the classification of natural language relations and studied the 9 relations between word pairs. However, it is not designed for scientific text analysis, and their challenge differs from the challenge of this paper in its generalizability; this paper’s relations are specific to ACL papers (e.g. MODEL-FEATURE), whereas the 2010 relations are more general, and might necessitate more common-sense knowledge than the 2018 relations. Xu et al. (2015a) and Santos et al. (2015) both applied CNN with negative sampling to finish task7. The 2017 SemEval Task 10 also featured relation extraction within scientific publications.

== Algorithm ==

[[File:CNN.png|800px|center]]

This is the architecture of CNN. We first transform a sentence via Feature embeddings. Word representations are encoded by the column vector in the embedding matrix <math> W^{word} \in \mathbb{R}^{d^w \times |V|}</math>, where <math>V</math> is the vocabulary of the dataset. Each colummn is the word embedding vector for the <math>i^{th}</math> word in the vocabulary. This matrix is trainale during the optimization process and initialized by pre-trained emmbedding vectors. Basically, we transform each sentence into continuous word embeddings:

$$
(e^{w_i})
$$

And word position embeddings:
$$
(e^{wp_i}): e_i = [e^{w_i}, e^{wp_i}]
$$

In the word embeddings, we generated a vocabulary <math> V </math>. We will then generate an embedding word matrix based on the position of the word in the vocabulary. This matrix is trainable and needs to be initialized by pre-trained embedding vectors such as through GloVe or Word2Vec.

In the word position embeddings, we first need to input some words named ‘entities,’ and they are the key for the machine to determine the sentence’s relation. During this process, if we have two entities, we will use the relative position of them in the sentence to make the
embeddings. We will output two vectors, and one of them keeps track of the first entity relative position in the sentence ( we will make the entity recorded as 0, the former word recorded as -1 and the next one 1, etc. ). And the same procedure for the second entity. Finally, we will get two vectors concatenated as the position embedding. For example, in the sentence "the black '''cat''' jumped", the position embedding of "'''cat'''" is -2,-1,0,1.

After the embeddings, the model will transform the embedded sentence into a fix-sized representation of the whole sentence via the convolution layer. Finally, after the max-pooling to reduce the dimension of the output of the layers, we will get a score for each relation class via a linear transformation.

After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like
$$e=[e_{1},e_{2},\ldots,e_{N}]$$
and each entry represents a token of the word. Also, to apply
convolutional neural network, the subsets of features
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$
are given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to
produce a new feature, defiend as
$$c_{i}=\text{tanh}(W\cdot e_{i:i+k-1}+bias)$$
This process is applied to all subsets of features with length <math> k </math> starting
from the first one. Then a mapped feature factor is produced:
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$

The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.
With different weight filter, different mapped feature vectors can be obtained. Finally, the original
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.

Then, the score vector
$$s(x)=W^{classes}r_{x}$$
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.

To improve the performance, “Negative Sampling" was used. Given the trained data point
<math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative
distance (negative margin plus the second largest score) should be minimized. So the loss function is
$$L=\log(1+e^{\gamma(m^{+}-s(x)_{y})})+\log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))})$$
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total,
and 49,600 of them are unique.

== Results ==
In machine learning, the most important part is to tune the hyper-parameters. Unlike traditional hyper-parameter optimization, there are some
modifications to the model to increase performance on the test set. There are 5 modifications that we can apply:

'''1.''' Merged Training Sets. It combined two training sets to increase the data set
size and it improves the equality between classes to get better predictions.

'''2.''' Reversal Indicate Features. It added a binary feature.

'''3.''' Custom ACL Embeddings. It embedded a word vector to an ACL-specific
corps.

'''4.''' Context words. Within the sentence, it varies in size on a context window
around the entity-enclosed text.

'''5.''' Ensembling. It used different early stop and random initializations to improve
the predictions.

These modifications performances well on the training data and they are shown
in table 3.

[[File:table3.PNG|center]]

As we can see the best choice for this model is ensembling as the random initialization made the data more natural and avoided the overfit.
During the training process, there are some methods such that they can only
increase the score on the cross-validation test sets but hurt the performance on
the overall macro-F1 score. Thus, these methods were eventually ruled out.

[[File:table4.PNG|center]]

There are six submissions in total. Three for each training set and the result
is shown in figure 2.

The best submission for the training set 1.1 is the third submission which does not
use a separate cross-validation dataset. Instead, a constant number of
training epochs are run with cross-validation based on the training data.

This compares to the other submissions in which 10% of the training data are
removed to form the validation set (both with and without stratification).
Predictions for these are made when the model has the highest validation accuracy.

The best submission for the training set 1.2 is the submission which
extracted 10% of the training data to form the validation dataset. Predictions are
made when maximum accuracy is reached on the validation data.

All in all, early stopping cannot always be based on the accuracy of the validation set
since it cannot guarantee to get better performance on the real test set. Thus,
we have to try new approaches and combine them to see the prediction
results. Also, doing stratification will certainly improve the performance of
the test data.

== Conclusions ==
Throughout the process, we have experimented with linear classifiers, sequential random forest, LSTM, and CNN models with various variations applied, such as two models of attention, negative sampling, entity embedding or sentence-only embedding, etc.

Among all variations, vanilla CNN with negative sampling and ACL-embedding, without attention, has significantly better performance than all others. Attention-based pooling, up-sampling, and data augmentation are also tested, but they barely perform positive increment on the behavior.

== Critiques ==

- Applying this in news apps might be beneficial to improve readability by highlighting specific important sections.

- The data set come from 350 scientist papers, this could be more explained by the author on how those paper are selected and why those paper are important to discuss.

- In the section of previous work, the author mentioned 9 natural language relationships between the word pairs. Among them, 6 potential relationships are USAGE, RESULT, MODEL-FEATURE, PART WHOLE, TOPIC, and COMPARE. It would help the readers to better understand if all 9 relationships are listed in the summary.

-This topic is interesting and this application might be helpful for some educational websites to improve their website to help readers focus on the important points. I think it will be nice to use Latex to type the equation in the sentence rather than center the equation on the next line. I think it will be interesting to discuss applying this way to other languages such as Chinese, Japanese, etc.

- It would be a good idea if the authors can provide more details regarding ACL Embeddings and Context words modifications. Scores generated using these two modifications are quite close to the highest Ensembling modification generated score, which makes it a valid consideration to examine these two modifications in detail.

- This paper is dealing with a similar problem as 'Neural Speed Reading Via Skim-RNN', num 19 paper summary. It will be an interesting approach to compare these two models' performance based on the same dataset.

- I think it would be highly practical to implement this system as a page-rank system for search engines (such as google, bing, or other platforms like Facebook, Instagram, etc.) by finding the most prevalent information available in a search query and then matching the search to the related text which can be found on webpages. This could also be implemented in search bars on specific websites or locations as well.

- It would be interesting to see in the future how the model would behave if data not already trained was used. This pre-trained data as mentioned in the paper had noise included. Using cleaner data would give better results maybe.

- The selection of the training dataset, i.e. the abstracts of scientific papers, is an excellent idea since the abstracts usually contain more information than the body. But it may be also a good idea to train the model with the conclusions. Other than that, the result of applying the model to the body part of the scientific papers may show some interesting features of the model.

- From Table 4 we find that comparing with using a fixed number of training periods, early stopping based on the accuracy of the validation set does not guarantee better test set performance. The label ratio of the validation set is layered according to the training set, which helps to improve the performance of the test set. Whether it is beneficial to add entity embedding as an additional feature could be an interesting point of discussion.

- The author mentioned the use of CNNs for contextual understanding. NLP models based on CNNs have sometimes been inadequate for the use of understanding the context of words used in a body of text, being very prone to overfitting the context of the training data. It would help if more evidence was shown as to how the paper deals with this problem.

- Deep neural network with a complex structure and huge parameter set is good at fitting the model. However, overfitting is a problem. Some strategies can be discussed like dropout strategies. Also, the choice of the most appropriate number of hidden layers is related to many factors like the scale of the training corpus. This can also be talked about in the paper.

== References ==
[1] Diederik P Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980.

[2] DragomirR. Radev, Pradeep Muthukrishnan, Vahed
Qazvinian, and Amjad Abu-Jbara. 2013. The ACL
anthology network corpus. Language Resources
and Evaluation, pages 1–26.

[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013a. Efficient estimation of word
representations in vector space. arXiv preprint
arXiv:1301.3781.

[4] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,
and Jeff Dean. 2013b. Distributed representations
of words and phrases and their compositionality.
In Advances in neural information processing
systems, pages 3111–3119.

[5] Kata Gbor, Davide Buscaldi, Anne-Kathrin Schumann, Behrang QasemiZadeh, Hafa Zargayouna,
and Thierry Charnois. 2018. Semeval-2018 task 7:Semantic relation extraction and classification in scientific papers.
In Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval2018), New Orleans, LA, USA, June 2018.

Task Understanding from Confusing Multi-task Data

2020-12-01T13:05:06Z

D5cui: /* CSL-Net */

'''Presented By'''

Qianlin Song, William Loh, Junyue Bai, Phoebe Choi

= Introduction =

Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, assist doctors to make data-driven decisions, and even self-driving cars. One of the most famous integrated forms of Narrow AI is Apple's Siri. Siri has no self-awareness or genuine intelligence, and hence often has challenges performing tasks outside its range of abilities. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI.

General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. For an isolated and very difficult task, the artificial intelligence may not learn it very well. For instance, a net with pixel dimension 1000*1000 is less likely to identify complicated objects in real-world situations on the time basis. However, if it could be learned simultaneously, it would be better as the tasks can share what they learned. It is easier for the learner to learn together instead of in isolation, for example, shapes, landmarks, textures, orientation and so on. This is called Multitask Learning. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste.

[[File:CSLFigure1.PNG | 500px]]

Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations.

This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates an input to its respective task and the latter function maps the input to its label within the allocated tasks. See figure 1(b). To implement the CSL, we use two neural networks to represent the de-confusing function and mapping function respectively. However, simply combining the two functions or networks to a single architecture is impossible, since the one-hot constraint of the outputs for the de-confusing network makes the gradient back-propagation unfeasible. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization in the proposed architecture CLS-Net.

Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine assigned with complete information.

= Related Work =

[[File:CSLFigure2.PNG | 700px]]

==Latent variable learning==
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, we know that samples are generated from multiple distinct distributions instead of one distribution combining a mixture of multiple probability models. Thus, the latent variable learning can not fully distinguish labels into different tasks and different distributions, and it is insufficient to classify the multi-task confusing samples.

==Multi-task learning==
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved the overall learning efficiency, since the labels in different tasks are often correlated: improving the classfication result for one class also help with other classification tasks. In multi-task learning, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. If such manuual task annotation is abstent, then the algorithm can not be performed.

==Multi-label learning==
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.

= Confusing Supervised Learning =

== Description of the Problem ==

Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, let <math> (x,y)</math> be the training samples from <math>y=f(x)</math>, which is an identical but unknown mapping relationship. Assuming the risk measure is mean squared error (MSE), the expected risk function is

$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$

where <math>p(x)</math> is the data distribution of the input variable <math>x</math>. In practice, the methods select the optimal function by minimizing the empirical risk:

$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$

To minimize the risk function, the theoretically optimal solution is <math> f(x) </math>.

When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk function can be modified to fit this new task for traditional supervised learning methods.

$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$

We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math>. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. The solution represents a mixed probably model instead of knowing the exact tasks and their correpsonding individual probability distribution. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.

== Learning Functions of CSL ==

To overcome this issue, the authors introduce two types of learning functions:
* '''Deconfusing function''' — allocation of which samples come from the same task
* '''Mapping function''' — mapping relation from input to the output of every learned task

Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (using MSE loss) is

$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$

which can be estimated empirically with

$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$

The risk metric of every sample affects only its assigned task.

== Theoretical Results ==

This novel framework yields some theoretical results to show the viability of its construction.

'''Theorem 1 (Existence of Solution)'''
''With the confusing supervised learning framework, there is an optimal solution''
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$

$$g_k^*(x) = f_k(x)$$

''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''

However, necessity constraints are needed to avoid meaningless trivial solutions in all optimal risk solutions.

'''Theorem 2 (Error Bound of CSL)'''
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by

$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$

''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$

This theorem shows the method of empirical risk minimization is valid in the CSL framework. Moreover, the assumed number of tasks affects the VC dimension of the learning functions, which is positively related to the generalization error. Therefore, to make the training risk small, we need to choose the ''minimum number'' of tasks when determining the task.

= CSL-Net =
In this section, the authors describe how to implement and train a network for CSL, including the stucture of CSL-Net and Iterative deconfusing algorithm.

== The Structure of CSL-Net ==
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$

The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.

== Iterative Deconfusing Algorithm ==
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks.

'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.

'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.

The two optimization stages are carried out alternately until the solution converges.

=Experiment=
==Setup==

3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks.

'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to reproduce both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks.

'''Colorful-MNIST''': The first image classification data set consists of digit data in a range of 0 to 9, each of which is in a single color among the eight different colors. Each observation in this modified set consists of a colored image (<math>x_i</math>) and a label (<math>y_i</math>) that represents either the corresponding color, or the digit. The goal is to reproduce the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks.

'''Kaggle Fashion Product''': The second image classification data set consists of several fashion-related objects labeled from any of the 3 criteria: “gender”, “category”, and “main color”, whose number of observations is larger than that of the "colored-MNIST" data set.

==Use of Pre-Trained CNN Feature Layers==

In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks. The CSL methods autonomously learned three tasks which corresponded exactly to “Gender”,
“Category”, and “Color” as we see it.

==Metrics of Confusing Supervised Learning==

There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function.

'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.

$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$

The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.

'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.

$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$

The purpose of this measure arises from the fact that, in addition to learning mapping allocations like humans, machines should be able to approximate all mapping functions accurately in order to provide corresponding labels. The Label Prediction Accuracy measure captures the exchange equivalence of the following task: each mapping contains its ground-truth output, and machines should be predicting the correct output that is close to the ground-truth.

==Results==

Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.

'''Function Regression''': To "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well in learning the ground-truth. That means the initialization of the neural network is set up properly.

'''Image Classification''': Visualizations created through Spectral embedding confirm the task labelling proficiency of the deconfusing neural network <math>h</math>.

The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.

==Application of Multi-label Learning==

CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.

Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reactions from the text.

Multi-label can be used to improve the syndrome diagnosis of a patient by focusing on multiple syndromes instead of a single syndrome.

==Limitations==

'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.

'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.

= Conclusion =

This paper proposes the CSL method for tackling the multi-task learning problem without manual task annotations from basic input data. The model obtains a basic task concept by learning the minimum risk for confusing samples from differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.

However, some limitations can be improved for future work:

- The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process;

- The current algorithm is difficult to learn basic features directly through a CNN structure and understand tasks simultaneously by training a full-connect network. However, this limitation does not affect the effectiveness of our algorithm in learning confusing data based on pre-trained features.

= Critique =

The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.

Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.

Compared to the standard supervised learning, Multi-label learning can associate a training sample with multiple category tags at the same time. It can assign multiple labels to some hidden instances and can be reduced to standard supervised learning by limiting the number of class labels per instance.

This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.

This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed to converge because of the strong correlation in prediction between the two networks.

- It would be interesting to see this implemented in more examples, to test the robustness of different types of data. The validation tasks chosen by data are all very simple, and CSL is actually not necessary for those tasks. For the colored MNIST data, a simple function can be written to distinguish the color label from the number label. The same problem applied to the Kaggle Fashion product dataset. The candidate label can be easily classified into different tasks by some wording analysis or meaning classification program or even manual classification. Even though the idea discussed by authors are interesting, the examples suggested by authors seem to suggest very limited or even unnessary application. In most cases, it is more benefitial to treat the Confusing Multi-task Data problems separately into two distinct stages: we classify the tasks first according to the meaning of the label, and then we perform a multi-class/multi-label training process.

Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.

When using this framework for classification, the order of the one-hot classification labels for each task will likely influence the relationships learned between each task, since the same output header is used for all tasks. This may be why this method fails to learn low-level representations and requires pretraining. I would like to see more explanation in the paper about why this isn't a problem if it was investigated.

It would be a good idea to include comparison details in the summary to make the results and the conclusion more convincing. For instance, though the paper introduced the result generated using confusion data, and provide some applications for multi-label learning, these two sections still fell short and could use some technical details as supporting evidence.

It is interesting to investigate if the order of adding tasks will influence the model performance.

It would be interesting to see the effectiveness of applying CSL in face recognition, such that not only does the algorithm map the face to identity, it also categorizes the face based on other features like beard/no beard and glasses/no glasses simultaneously.

For pattern recognition,pre-trained features were used in the algorithm. It would be interesting to see how the effectiveness of the model changes if we train it with data directly from the CNN structure in the future.

So basically given a confused dataset CSL finds the important tasks or labels from the dataset as can be seen from the fruit example. In the example, fruits are grouped under their names, their tastes, and their color, when CSL is given a mixed dataset. Hence given an unstructured data, unlabeled, confused dataset CSL helps in finding the labels, which in turn can help in cleaning the dataset and further in preparing high-quality training data set which is very important in different ML algorithms. Since at present preparing these dataset requires manual data annotations, CSL can save time in that process.

For the Colorful-Mnist data set, the goal is to understand the concept of multiple classification tasks from these examples. All inputs have multiple classification tasks. Each observed sample only represents the classification result of one task, and the task from which the sample comes is unknown.

It would be nice to know why the given metrics of confusing supervised learning are used. The authors should have used several different metrics and show that CSL's overall performs better than other methods. And what are "the other methods" referring to?

In the Iterative Deconfusing Algorithmn section, the Training of Mapping-Net needs more explanation. The authors should specify what it is doing before showing its equations.

For the results section, it would be more intuitive and stronger if the author provide more detail on these two methods and add a plot to support the claim. Based on the text, it might not be an obvious comparison.

= References =

[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."

[2] Caruana, R. (1997) "Multi-task learning"

[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8.

[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.

[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.

[6] Guo-Ping Liu, Jian-Jun Yan, Yi-Qin Wang, Jing-Jing Fu, Zhao-Xia Xu, Rui Guo, Peng Qian, "Application of Multilabel Learning Using the Relevant Feature for Each Label in Chronic Gastritis Syndrome Diagnosis", Evidence-Based Complementary and Alternative Medicine, vol. 2012, Article ID 135387, 9 pages, 2012. https://doi.org/10.1155/2012/135387

stat441F21

2020-12-01T07:43:14Z

D5cui:

== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==



= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=

Use the following notations:

P: You have written a summary/critique of the paper.

T: You had a technical contribution on a paper (excluding the paper that you present).

E: You had an editorial contribution on a paper (excluding the paper that you present).

=Paper presentation=
{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="250pt"|Name
|width="15pt"|Paper number
|width="700pt"|Title
|width="15pt"|Link to the paper
|width="30pt"|Link to the summary
|width="30pt"|Link to the video
|-
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]
|-
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||
[https://www.youtube.com/watch?v=TVLpSFYgF0c&feature=youtu.be]
|-
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Influenza_Forecasting_Framework_based_on_Gaussian_Processes Summary]|| [https://www.youtube.com/watch?v=HZG9RAHhpXc&feature=youtu.be]
|-
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin Summary] || [https://learn.uwaterloo.ca/d2l/ext/rp/577051/lti/framedlaunch/6ec1ebea-5547-46a2-9e4f-e3dc9d79fd54]
|-
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Streaming_Bayesian_Inference_for_Crowdsourced_Classification Summary] || [https://www.youtube.com/watch?v=UgVRzi9Ubws]
|-
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| Neural Ordinary Differential Equations || [https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_ODEs Summary]||
|-
|Week of Nov 16 || Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun || 6|| Adversarial Attacks on Copyright Detection Systems || Paper [https://proceedings.icml.cc/static/paper_files/icml/2020/1894-Paper.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems Summary] || [https://www.youtube.com/watch?v=bQI9S6bCo8o]
|-
|Week of Nov 16 || Casey De Vera, Solaiman Jawad || 7|| IPBoost – Non-Convex Boosting via Integer Programming || [https://arxiv.org/pdf/2002.04679.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=IPBoost Summary] || [https://www.youtube.com/watch?v=4DhJDGC4pyI&feature=youtu.be]
|-
|Week of Nov 16 || Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran || 8|| What Game Are We Playing? End-to-end Learning in Normal and Extensive Form Games || [https://arxiv.org/pdf/1805.02777.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playing Summary] || [https://www.youtube.com/watch?v=9qJoVxo3hnI&feature=youtu.be]
|-
|Week of Nov 16 || Yuchuan Wu || 9|| || || ||
|-
|Week of Nov 16 || Zhou Zeping, Siqi Li, Yuqin Fang, Fu Rao || 10|| A survey of neural network-based cancer prediction models from microarray data || [https://www.sciencedirect.com/science/article/pii/S0933365717305067 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Y93fang Summary] || [https://youtu.be/B8pPUU8ypO0]
|-
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou Summary] || [https://www.youtube.com/watch?v=tvCEvvy54X8&ab_channel=JJLian]
|-
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Networks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Cvmustat Summary] || [https://www.youtube.com/watch?v=or5RTxDnZDo]
|-
|Week of Nov 23 || Taohao Wang, Zeren Shen, Zihao Guo, Rui Chen || 13|| Large Scale Landmark Recognition via Deep Metric Learning || [https://arxiv.org/pdf/1908.10192.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:T358wang Summary] || [https://www.youtube.com/watch?v=K9NypDNPLJo Video]
|-
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data Summary] || [https://youtu.be/i_5PQdfuH-Y]
|-
|Week of Nov 23 || Rui Gong, Xuetong Wang, Xinqi Ling, Di Ma || 15|| Semantic Relation Classification via Convolution Neural Network|| [https://www.aclweb.org/anthology/S18-1127.pdf paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification——via_Convolution_Neural_Network Summary]|| [https://www.youtube.com/watch?v=m9o3NuMUKkU&ab_channel=DiMa video]
|-
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| Graph Structure of Neural Networks || [https://proceedings.icml.cc/paper/2020/file/757b505cfd34c64c85ca5b5690ee5293-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_Networks Summary] || [https://youtu.be/x9eUgEwntcs Video]
|-
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker Summary]|| [https://www.youtube.com/watch?v=kazqcOwbtTI Video]
|-
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Point-of-Interest_Recommendation:_Exploiting_Self-Attentive_Autoencoders_with_Neighbor-Aware_Influence Summary] || [https://www.youtube.com/watch?v=aAwjaos_Hus Video]
|-
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN Summary]|| [https://youtu.be/vOENmt9jgVE Video]
|-
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Yktan Summary]|| [https://www.youtube.com/watch?v=48xYZXifjS0&ab_channel=SeakraChin]
|-
|Week of Nov 30 || Banno Dion, Battista Joseph, Kahn Solomon || 21|| Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks || [https://www.sciencedirect.com/science/article/pii/S1877050919310646] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Music_Recommender_System_Based_using_CRNN#Evaluation_of_Music_Recommendation_System: Summary] || [https://youtu.be/eGUV3zwLwqQ]
|-
|Week of Nov 30 || Isaac Ellmen, Dorsa Mohammadrezaei, Emilee Carson || 22|| A universal SNP and small-indel variant caller using deep neural networks||[https://www.nature.com/articles/nbt.4235.epdf?author_access_token=q4ZmzqvvcGBqTuKyKgYrQ9RgN0jAjWel9jnR3ZoTv0NuM3saQzpZk8yexjfPUhdFj4zyaA4Yvq0LWBoCYQ4B9vqPuv8e2HHy4vShDgEs8YxI_hLs9ov6Y1f_4fyS7kGZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_universal_SNP_and_small-indel_variant_caller_using_deep_neural_networks Summary] ||
|-
|Week of Nov 30 || Daniel Fagan, Cooper Brooke, Maya Perelman || 23|| Efficient kNN Classification With Different Number of Nearest Neighbors || [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7898482 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Dfagan Summary]|| [https://youtu.be/_STVyvm_Kao]
|-
|Week of Nov 30 || Karam Abuaisha, Evan Li, Jason Pu, Nicholas Vadivelu || 24|| Being Bayesian about Categorical Probability || [https://proceedings.icml.cc/static/paper_files/icml/2020/3560-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Being_Bayesian_about_Categorical_Probability Summary] || [https://drive.google.com/file/d/1I0uYF2xEMuNVtaEhPb_vZ6bxSKMi0gUh/view?usp=sharing]
|-
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || Summary [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Loss_Function_Search_for_Face_Recognition] || [https://youtu.be/i3dXnK9HGSQ]
|-
|Week of Nov 30 ||Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee || 26|| Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms || [https://arxiv.org/pdf/1912.07618.pdf?fbclid=IwAR0RwATSn4CiT3qD9LuywYAbJVw8YB3nbex8Kl19OCExIa4jzWaUut3oVB0 Paper] || Summary [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&fbclid=IwAR1Tad2DAM7LT6NXXuSYDZtHHBvN0mjZtDdCOiUFFq_XwVcQxG3hU-3XcaE] || [https://www.youtube.com/watch?v=kiYbAmd_3IA]
|-
|Week of Nov 30 || Stan Lee, Seokho Lim, Kyle Jung, Dae Hyun Kim || 27|| Improving neural networks by preventing co-adaption of feature detectors || [https://arxiv.org/pdf/1207.0580.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Improving_neural_networks_by_preventing_co-adaption_of_feature_detectors Summary] || [https://youtu.be/SV5UOM3QwiA Video]
|-
|Week of Nov 30 || Yawen Wang, Danmeng Cui, ZiJie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Describtion_of_Text_Mining Summary] || [https://youtu.be/to1fj0GyTAg]
|-
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN Summary] || [https://youtu.be/NgcSMXNDNuU]
|-
|Week of Nov 30 || Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Research_Papers_Classification_System Summary] || [https://youtu.be/Ug-5H4B5xkQ]
|-
|Week of Nov 30 || Daniel Zhang, Jacky Yao, Scholar Sun, Russell Parco, Ian Cheung || 31 || Speech2Face: Learning the Face Behind a Voice || [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Speech2Face:_Learning_the_Face_Behind_a_Voice Summary] || [https://youtu.be/lNQAbMxOj4w]
|-
|Week of Nov 30 || Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du || 32 || Evaluating Machine Accuracy on ImageNet || [https://proceedings.icml.cc/static/paper_files/icml/2020/6173-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Evaluating_Machine_Accuracy_on_ImageNet Summary] || [https://youtu.be/jj4S3VGzQz4 Video]
|-
|Week of Nov 30 || Mushi Wang, Siyuan Qiu, Yan Yu || 33 || Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections || [https://ieeexplore.ieee.org/abstract/document/8957421 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction Summary] || [https://youtu.be/cqyn3aO_5tc Video 33]

stat441F21

2020-12-01T07:28:36Z

D5cui:

== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==



= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=

Use the following notations:

P: You have written a summary/critique of the paper.

T: You had a technical contribution on a paper (excluding the paper that you present).

E: You had an editorial contribution on a paper (excluding the paper that you present).

=Paper presentation=
{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="250pt"|Name
|width="15pt"|Paper number
|width="700pt"|Title
|width="15pt"|Link to the paper
|width="30pt"|Link to the summary
|width="30pt"|Link to the video
|-
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]
|-
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||
[https://www.youtube.com/watch?v=TVLpSFYgF0c&feature=youtu.be]
|-
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Influenza_Forecasting_Framework_based_on_Gaussian_Processes Summary]|| [https://www.youtube.com/watch?v=HZG9RAHhpXc&feature=youtu.be]
|-
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin Summary] || [https://learn.uwaterloo.ca/d2l/ext/rp/577051/lti/framedlaunch/6ec1ebea-5547-46a2-9e4f-e3dc9d79fd54]
|-
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Streaming_Bayesian_Inference_for_Crowdsourced_Classification Summary] || [https://www.youtube.com/watch?v=UgVRzi9Ubws]
|-
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| Neural Ordinary Differential Equations || [https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_ODEs Summary]||
|-
|Week of Nov 16 || Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun || 6|| Adversarial Attacks on Copyright Detection Systems || Paper [https://proceedings.icml.cc/static/paper_files/icml/2020/1894-Paper.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems Summary] || [https://www.youtube.com/watch?v=bQI9S6bCo8o]
|-
|Week of Nov 16 || Casey De Vera, Solaiman Jawad || 7|| IPBoost – Non-Convex Boosting via Integer Programming || [https://arxiv.org/pdf/2002.04679.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=IPBoost Summary] || [https://www.youtube.com/watch?v=4DhJDGC4pyI&feature=youtu.be]
|-
|Week of Nov 16 || Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran || 8|| What Game Are We Playing? End-to-end Learning in Normal and Extensive Form Games || [https://arxiv.org/pdf/1805.02777.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playing Summary] || [https://www.youtube.com/watch?v=9qJoVxo3hnI&feature=youtu.be]
|-
|Week of Nov 16 || Yuchuan Wu || 9|| || || ||
|-
|Week of Nov 16 || Zhou Zeping, Siqi Li, Yuqin Fang, Fu Rao || 10|| A survey of neural network-based cancer prediction models from microarray data || [https://www.sciencedirect.com/science/article/pii/S0933365717305067 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Y93fang Summary] || [https://youtu.be/B8pPUU8ypO0]
|-
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou Summary] || [https://www.youtube.com/watch?v=tvCEvvy54X8&ab_channel=JJLian]
|-
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Networks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Cvmustat Summary] || [https://www.youtube.com/watch?v=or5RTxDnZDo]
|-
|Week of Nov 23 || Taohao Wang, Zeren Shen, Zihao Guo, Rui Chen || 13|| Large Scale Landmark Recognition via Deep Metric Learning || [https://arxiv.org/pdf/1908.10192.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:T358wang Summary] || [https://www.youtube.com/watch?v=K9NypDNPLJo Video]
|-
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data Summary] || [https://youtu.be/i_5PQdfuH-Y]
|-
|Week of Nov 23 || Rui Gong, Xuetong Wang, Xinqi Ling, Di Ma || 15|| Semantic Relation Classification via Convolution Neural Network|| [https://www.aclweb.org/anthology/S18-1127.pdf paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification——via_Convolution_Neural_Network Summary]|| [https://www.youtube.com/watch?v=m9o3NuMUKkU&ab_channel=DiMa video]
|-
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| Graph Structure of Neural Networks || [https://proceedings.icml.cc/paper/2020/file/757b505cfd34c64c85ca5b5690ee5293-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_Networks Summary] || [https://youtu.be/x9eUgEwntcs Video]
|-
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker Summary]|| [https://www.youtube.com/watch?v=kazqcOwbtTI Video]
|-
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Point-of-Interest_Recommendation:_Exploiting_Self-Attentive_Autoencoders_with_Neighbor-Aware_Influence Summary] || [https://www.youtube.com/watch?v=aAwjaos_Hus Video]
|-
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN Summary]|| [https://youtu.be/vOENmt9jgVE Video]
|-
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Yktan Summary]|| [https://www.youtube.com/watch?v=48xYZXifjS0&ab_channel=SeakraChin]
|-
|Week of Nov 30 || Banno Dion, Battista Joseph, Kahn Solomon || 21|| Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks || [https://www.sciencedirect.com/science/article/pii/S1877050919310646] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Music_Recommender_System_Based_using_CRNN#Evaluation_of_Music_Recommendation_System: Summary] || [https://youtu.be/eGUV3zwLwqQ]
|-
|Week of Nov 30 || Isaac Ellmen, Dorsa Mohammadrezaei, Emilee Carson || 22|| A universal SNP and small-indel variant caller using deep neural networks||[https://www.nature.com/articles/nbt.4235.epdf?author_access_token=q4ZmzqvvcGBqTuKyKgYrQ9RgN0jAjWel9jnR3ZoTv0NuM3saQzpZk8yexjfPUhdFj4zyaA4Yvq0LWBoCYQ4B9vqPuv8e2HHy4vShDgEs8YxI_hLs9ov6Y1f_4fyS7kGZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_universal_SNP_and_small-indel_variant_caller_using_deep_neural_networks Summary] ||
|-
|Week of Nov 30 || Daniel Fagan, Cooper Brooke, Maya Perelman || 23|| Efficient kNN Classification With Different Number of Nearest Neighbors || [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7898482 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Dfagan Summary]|| [https://youtu.be/_STVyvm_Kao]
|-
|Week of Nov 30 || Karam Abuaisha, Evan Li, Jason Pu, Nicholas Vadivelu || 24|| Being Bayesian about Categorical Probability || [https://proceedings.icml.cc/static/paper_files/icml/2020/3560-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Being_Bayesian_about_Categorical_Probability Summary] || [https://drive.google.com/file/d/1I0uYF2xEMuNVtaEhPb_vZ6bxSKMi0gUh/view?usp=sharing]
|-
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || Summary [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Loss_Function_Search_for_Face_Recognition] || [https://youtu.be/i3dXnK9HGSQ]
|-
|Week of Nov 30 ||Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee || 26|| Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms || [https://arxiv.org/pdf/1912.07618.pdf?fbclid=IwAR0RwATSn4CiT3qD9LuywYAbJVw8YB3nbex8Kl19OCExIa4jzWaUut3oVB0 Paper] || Summary [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&fbclid=IwAR1Tad2DAM7LT6NXXuSYDZtHHBvN0mjZtDdCOiUFFq_XwVcQxG3hU-3XcaE] || [https://www.youtube.com/watch?v=kiYbAmd_3IA]
|-
|Week of Nov 30 || Stan Lee, Seokho Lim, Kyle Jung, Dae Hyun Kim || 27|| Improving neural networks by preventing co-adaption of feature detectors || [https://arxiv.org/pdf/1207.0580.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Improving_neural_networks_by_preventing_co-adaption_of_feature_detectors Summary] || [https://youtu.be/SV5UOM3QwiA Video]
|-
|Week of Nov 30 || Yawen Wang, Danmeng Cui, ZiJie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Describtion_of_Text_Mining Summary] || [https://youtu.be/e9uRZQtzOLY]
|-
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN Summary] || [https://youtu.be/NgcSMXNDNuU]
|-
|Week of Nov 30 || Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Research_Papers_Classification_System Summary] || [https://youtu.be/Ug-5H4B5xkQ]
|-
|Week of Nov 30 || Daniel Zhang, Jacky Yao, Scholar Sun, Russell Parco, Ian Cheung || 31 || Speech2Face: Learning the Face Behind a Voice || [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Speech2Face:_Learning_the_Face_Behind_a_Voice Summary] || [https://youtu.be/lNQAbMxOj4w]
|-
|Week of Nov 30 || Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du || 32 || Evaluating Machine Accuracy on ImageNet || [https://proceedings.icml.cc/static/paper_files/icml/2020/6173-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Evaluating_Machine_Accuracy_on_ImageNet Summary] || [https://youtu.be/jj4S3VGzQz4 Video]
|-
|Week of Nov 30 || Mushi Wang, Siyuan Qiu, Yan Yu || 33 || Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections || [https://ieeexplore.ieee.org/abstract/document/8957421 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction Summary] || [https://youtu.be/cqyn3aO_5tc Video 33]

stat441F21

2020-11-30T17:22:12Z

D5cui:

== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==



= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=

Use the following notations:

P: You have written a summary/critique of the paper.

T: You had a technical contribution on a paper (excluding the paper that you present).

E: You had an editorial contribution on a paper (excluding the paper that you present).

=Paper presentation=
{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="250pt"|Name
|width="15pt"|Paper number
|width="700pt"|Title
|width="15pt"|Link to the paper
|width="30pt"|Link to the summary
|width="30pt"|Link to the video
|-
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]
|-
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||
[https://www.youtube.com/watch?v=TVLpSFYgF0c&feature=youtu.be]
|-
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Influenza_Forecasting_Framework_based_on_Gaussian_Processes Summary]|| [https://www.youtube.com/watch?v=HZG9RAHhpXc&feature=youtu.be]
|-
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin Summary] || [https://learn.uwaterloo.ca/d2l/ext/rp/577051/lti/framedlaunch/6ec1ebea-5547-46a2-9e4f-e3dc9d79fd54]
|-
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Streaming_Bayesian_Inference_for_Crowdsourced_Classification Summary] || [https://www.youtube.com/watch?v=UgVRzi9Ubws]
|-
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| Neural Ordinary Differential Equations || [https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_ODEs Summary]||
|-
|Week of Nov 16 || Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun || 6|| Adversarial Attacks on Copyright Detection Systems || Paper [https://proceedings.icml.cc/static/paper_files/icml/2020/1894-Paper.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems Summary] || [https://www.youtube.com/watch?v=bQI9S6bCo8o]
|-
|Week of Nov 16 || Casey De Vera, Solaiman Jawad || 7|| IPBoost – Non-Convex Boosting via Integer Programming || [https://arxiv.org/pdf/2002.04679.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=IPBoost Summary] || [https://www.youtube.com/watch?v=4DhJDGC4pyI&feature=youtu.be]
|-
|Week of Nov 16 || Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran || 8|| What Game Are We Playing? End-to-end Learning in Normal and Extensive Form Games || [https://arxiv.org/pdf/1805.02777.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playing Summary] || [https://www.youtube.com/watch?v=9qJoVxo3hnI&feature=youtu.be]
|-
|Week of Nov 16 || Yuchuan Wu || 9|| || || ||
|-
|Week of Nov 16 || Zhou Zeping, Siqi Li, Yuqin Fang, Fu Rao || 10|| A survey of neural network-based cancer prediction models from microarray data || [https://www.sciencedirect.com/science/article/pii/S0933365717305067 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Y93fang Summary] || [https://youtu.be/B8pPUU8ypO0]
|-
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou Summary] || [https://www.youtube.com/watch?v=tvCEvvy54X8&ab_channel=JJLian]
|-
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Networks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Cvmustat Summary] || [https://www.youtube.com/watch?v=or5RTxDnZDo]
|-
|Week of Nov 23 || Taohao Wang, Zeren Shen, Zihao Guo, Rui Chen || 13|| Large Scale Landmark Recognition via Deep Metric Learning || [https://arxiv.org/pdf/1908.10192.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:T358wang Summary] || [https://www.youtube.com/watch?v=K9NypDNPLJo Video]
|-
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data Summary] || [https://youtu.be/i_5PQdfuH-Y]
|-
|Week of Nov 23 || Rui Gong, Xuetong Wang, Xinqi Ling, Di Ma || 15|| Semantic Relation Classification via Convolution Neural Network|| [https://www.aclweb.org/anthology/S18-1127.pdf paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification——via_Convolution_Neural_Network Summary]|| [https://www.youtube.com/watch?v=m9o3NuMUKkU&ab_channel=DiMa video]
|-
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| Graph Structure of Neural Networks || [https://proceedings.icml.cc/paper/2020/file/757b505cfd34c64c85ca5b5690ee5293-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_Networks Summary] || [https://youtu.be/x9eUgEwntcs Video]
|-
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker Summary]|| [https://www.youtube.com/watch?v=kazqcOwbtTI Video]
|-
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Point-of-Interest_Recommendation:_Exploiting_Self-Attentive_Autoencoders_with_Neighbor-Aware_Influence Summary] || [https://www.youtube.com/watch?v=aAwjaos_Hus Video]
|-
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN Summary]|| [https://youtu.be/vOENmt9jgVE Video]
|-
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Yktan Summary]|| [https://www.youtube.com/watch?v=48xYZXifjS0&ab_channel=SeakraChin]
|-
|Week of Nov 30 || Banno Dion, Battista Joseph, Kahn Solomon || 21|| Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks || [https://www.sciencedirect.com/science/article/pii/S1877050919310646] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Music_Recommender_System_Based_using_CRNN#Evaluation_of_Music_Recommendation_System: Summary] || [https://youtu.be/eGUV3zwLwqQ]
|-
|Week of Nov 30 || Isaac Ellmen, Dorsa Mohammadrezaei, Emilee Carson || 22|| A universal SNP and small-indel variant caller using deep neural networks||[https://www.nature.com/articles/nbt.4235.epdf?author_access_token=q4ZmzqvvcGBqTuKyKgYrQ9RgN0jAjWel9jnR3ZoTv0NuM3saQzpZk8yexjfPUhdFj4zyaA4Yvq0LWBoCYQ4B9vqPuv8e2HHy4vShDgEs8YxI_hLs9ov6Y1f_4fyS7kGZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_universal_SNP_and_small-indel_variant_caller_using_deep_neural_networks Summary] ||
|-
|Week of Nov 30 || Daniel Fagan, Cooper Brooke, Maya Perelman || 23|| Efficient kNN Classification With Different Number of Nearest Neighbors || [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7898482 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Dfagan Summary]|| [https://youtu.be/_STVyvm_Kao]
|-
|Week of Nov 30 || Karam Abuaisha, Evan Li, Jason Pu, Nicholas Vadivelu || 24|| Being Bayesian about Categorical Probability || [https://proceedings.icml.cc/static/paper_files/icml/2020/3560-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Being_Bayesian_about_Categorical_Probability Summary] || [https://drive.google.com/file/d/1I0uYF2xEMuNVtaEhPb_vZ6bxSKMi0gUh/view?usp=sharing]
|-
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || Summary [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Loss_Function_Search_for_Face_Recognition] || [https://youtu.be/i3dXnK9HGSQ]
|-
|Week of Nov 30 ||Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee || 26|| Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms || [https://arxiv.org/pdf/1912.07618.pdf?fbclid=IwAR0RwATSn4CiT3qD9LuywYAbJVw8YB3nbex8Kl19OCExIa4jzWaUut3oVB0 Paper] || Summary [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&fbclid=IwAR1Tad2DAM7LT6NXXuSYDZtHHBvN0mjZtDdCOiUFFq_XwVcQxG3hU-3XcaE] || [https://www.youtube.com/watch?v=kiYbAmd_3IA]
|-
|Week of Nov 30 || Stan Lee, Seokho Lim, Kyle Jung, Dae Hyun Kim || 27|| Improving neural networks by preventing co-adaption of feature detectors || [https://arxiv.org/pdf/1207.0580.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Improving_neural_networks_by_preventing_co-adaption_of_feature_detectors Summary] || [https://youtu.be/SV5UOM3QwiA Video]
|-
|Week of Nov 30 || Yawen Wang, Danmeng Cui, ZiJie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Describtion_of_Text_Mining Summary] ||
|-
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN Summary] || [https://youtu.be/NgcSMXNDNuU]
|-
|Week of Nov 30 || Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Research_Papers_Classification_System Summary] || [https://youtu.be/Ug-5H4B5xkQ]
|-
|Week of Nov 30 || Daniel Zhang, Jacky Yao, Scholar Sun, Russell Parco, Ian Cheung || 31 || Speech2Face: Learning the Face Behind a Voice || [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Speech2Face:_Learning_the_Face_Behind_a_Voice Summary] || [https://youtu.be/lNQAbMxOj4w]
|-
|Week of Nov 30 || Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du || 32 || Evaluating Machine Accuracy on ImageNet || [https://proceedings.icml.cc/static/paper_files/icml/2020/6173-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Evaluating_Machine_Accuracy_on_ImageNet Summary] ||
|-
|Week of Nov 30 || Mushi Wang, Siyuan Qiu, Yan Yu || 33 || Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections || [https://ieeexplore.ieee.org/abstract/document/8957421 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction Summary] ||

User:Cvmustat

2020-11-30T15:40:20Z

D5cui: /* Critiques */

== Combine Convolution with Recurrent Networks for Text Classification ==
'''Team Members''': Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea

'''Date''': Week of Nov 23

== Introduction ==

Text classification is the task of assigning a set of predefined categories to natural language texts. It involves learning an embedding layer which allows context-dependent classification. It is a fundamental task in Natural Language Processing (NLP) with various applications such as sentiment analysis, and topic classification. A classic example involving text classification is given a set of News articles, is it possible to classify the genre or subject of each article? Text classification is useful as text data is a rich source of information, but extracting insights from it directly can be difficult and time-consuming as most text data is unstructured.[1] NLP text classification can help automatically structure and analyze text quickly and cost-effectively, allowing for individuals to extract important features from the text easier than before.

Text classification work mainly focuses on three topics: feature engineering, feature selection, and the use of different types of machine learning algorithms.
:1. Feature engineering, the most widely used feature is the bag of words feature. Some more complex functions are also designed, such as part-of-speech tags, noun phrases, and tree kernels.
:2. Feature selection aims to remove noisy features and improve classification performance. The most common feature selection method is to delete stop words.
:3. Machine learning algorithms usually use classifiers, such as Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machine (SVM).

In practice, pre-trained word embeddings and deep neural networks are used together for NLP text classification. Word embeddings are used to map the raw text data to an implicit space where the semantic relationships of the words are preserved; words with similar meaning have a similar representation. One can then feed these embeddings into deep neural networks to learn different features of the text. Convolutional neural networks can be used to determine the semantic composition of the text(the meaning), as it treats texts as a 2D matrix by concatenating the embedding of words together. It uses a 1D convolution operator to perform the feature mapping and then conducts a 1D pooling operation over the time domain for obtaining a fixed-length output feature vector, and it can capture both local and position invariant features of the text.[2] Alternatively, Recurrent Neural Networks can be used to determine the contextual meaning of each word in the text (how each word relates to one another) by treating the text as sequential data and then analyzing each word separately. [3] Previous approaches to attempt to combine these two neural networks to incorporate the advantages of both models involve streamlining the two networks, which might decrease their performance. Besides, most methods incorporating a bi-directional Recurrent Neural Network usually concatenate the forward and backward hidden states at each time step, which results in a vector that does not have the interaction information between the forward and backward hidden states.[4] The hidden state in one direction contains only the contextual meaning in that particular direction, however a word's contextual representation, intuitively, is more accurate when collected and viewed from both directions. This paper argues that the failure to observe the meaning of a word in both directions causes the loss of the true meaning of the word, especially for polysemic words (words with more than one meaning) that are context-sensitive.

== Paper Key Contributions ==

This paper suggests an enhanced method of text classification by proposing a new way of combining Convolutional and Recurrent Neural Networks (CRNN) involving the addition of a neural tensor layer. The proposed method maintains each network's respective strengths that are normally lost in previous combination methods. The new suggested architecture is called CRNN, which utilizes both a CNN and RNN that run in parallel on the same input sentence. CNN uses weight matrix learning and produces a 2D matrix that shows the importance of each word based on local and position-invariant features. The bidirectional RNN produces a matrix that learns each word's contextual representation; the words' importance about to with concerning the rest of the sentence. A neural tensor layer is introduced on top of the RNN to obtain the fusion of bi-directional contextual information surrounding a particular word. This method combines these two matrix representations and classifies the text, providing the important information of each word for prediction, which helps to explain the results. The model also uses dropout and L2 regularization to prevent overfitting.

== CRNN Results vs Benchmarks ==

In order to benchmark the performance of the CRNN model, as well as compare it to other previous efforts, multiple datasets and classification problems were used. All of these datasets are publicly available and can be easily downloaded by any user for testing.

- '''Movie Reviews:''' a sentiment analysis dataset, with two classes (positive and negative).

- '''Yelp:''' a sentiment analysis dataset, with five classes. For this test, a subset of 120,000 reviews was randomly chosen from each class for a total of 600,000 reviews.

- '''AG's News:''' a news categorization dataset, using only the 4 largest classes from the dataset. There are 30000 training samples and 1900 test samples.

- '''20 Newsgroups:''' a news categorization dataset, again using only 4 large classes (comp, politics, rec, and religion) from the dataset.

- '''Sogou News:''' a Chinese news categorization dataset, using 10 major classes as a multi-class classification and include 6500 samples randomly from each class.

- '''Yahoo! Answers:''' a topic classification dataset, with 10 classes and each class contains 140000 training samples and 5000 testing samples.

For the English language datasets, the initial word representations were created using the publicly available ''word2vec'' [https://code.google.com/p/word2vec/] from Google news. For the Chinese language dataset, ''jieba'' [https://github.com/fxsjy/jieba] was used to segment sentences, and then 50-dimensional word vectors were trained on Chinese ''wikipedia'' using ''word2vec''.

A number of other models are run against the same data after preprocessing. Some of these models include:

- '''Self-attentive LSTM:''' an LSTM model with multi-hop attention for sentence embedding.

- '''RCNN:''' the RCNN's recurrent structure allows for increased depth of capture for contextual information. Less noise is introduced on account of the model's holistic structure (compared to local features).

The following results are obtained:

[[File:table of results.png|550px|center]]

The bold results represent the best performing model for a given dataset. These results show that the CRNN model is the best model for 4 of the 6 datasets, with the Self-attentive LSTM beating the CRNN by 0.03 and 0.12 on the news categorization problems. Considering that the CRNN model has better performance than the Self-attentive LSTM on the other 4 datasets, this suggests that the CRNN model is a better performer overall in the conditions of this benchmark.

It should be noted that including the neural tensor layer in the CRNN model leads to a significant performance boost compared to the CRNN models without it. The performance boost can be attributed to the fact that the neural tensor layer captures the surrounding contextual information for each word, and brings this information between the forward and backward RNN in a direct method. As seen in the table, this leads to a better classification accuracy across all datasets.

Another important result was that the CRNN model filter size impacted performance only in the sentiment analysis datasets, as seen in the following table, where the results for the AG's news, Yelp, and Yahoo! Answer datasets are displayed:

[[File:filter_effects.png|550px|center]]

== CRNN Model Architecture ==

The CRNN model is a combination of RNN and CNN. It uses a CNN to compute the importance of each word in the text and utilizes a neural tensor layer to fuse forward and backward hidden states of bi-directional RNN.

The input of the network is a text, which is a sequence of words. The output of the network is the text representation that is subsequently used as input of a fully-connected layer to obtain the class prediction.

'''RNN Pipeline:'''

The goal of the RNN pipeline is to input each word in a text, and retrieve the contextual information surrounding the word and compute the contextual representation of the word itself. This is accomplished by the use of a bi-directional RNN, such that a Neural Tensor Layer (NTL) can combine the results of the RNN to obtain the final output. RNNs are well-suited to NLP tasks because of their ability to sequentially process data such as ordered text.

A RNN is similar to a feed-forward neural network, but it relies on the use of hidden states. Hidden states are layers in the neural net that produce two outputs: <math> \hat{y}_{t} </math> and <math> h_t </math>. For a time step <math> t </math>, <math> h_t </math> is fed back into the layer to compute <math> \hat{y}_{t+1} </math> and <math> h_{t+1} </math>.

The pipeline will actually use a variant of RNN called GRU, short for Gated Recurrent Units. This is done to address the vanishing gradient problem which causes the network to struggle to memorize words that came earlier in the sequence. Traditional RNNs are only able to remember the most recent words in a sequence, which may be problematic since words that came at the beginning of the sequence that is important to the classification problem may be forgotten. A GRU attempts to solve this by controlling the flow of information through the network using update and reset gates.

Let <math>h_{t-1} \in \mathbb{R}^m, x_t \in \mathbb{R}^d </math> be the inputs, and let <math>\mathbf{W}_z, \mathbf{W}_r, \mathbf{W}_h \in \mathbb{R}^{m \times d}, \mathbf{U}_z, \mathbf{U}_r, \mathbf{U}_h \in \mathbb{R}^{m \times m}</math> be trainable weight matrices. Then the following equations describe the update and reset gates:

<math>
z_t = \sigma(\mathbf{W}_zx_t + \mathbf{U}_zh_{t-1}) \text{update gate} \\
r_t = \sigma(\mathbf{W}_rx_t + \mathbf{U}_rh_{t-1}) \text{reset gate} \\
\tilde{h}_t = \text{tanh}(\mathbf{W}_hx_t + r_t \circ \mathbf{U}_hh_{t-1}) \text{new memory} \\
h_t = (1-z_t)\circ \tilde{h}_t + z_t\circ h_{t-1}
</math>

Note that <math> \sigma, \text{tanh}, \circ </math> are all element-wise functions. The above equations do the following:

<ol>
<li> <math>h_{t-1}</math> carries information from the previous iteration and <math>x_t</math> is the current input </li>
<li> the update gate <math>z_t</math> controls how much past information should be forwarded to the next hidden state </li>
<li> the rest gate <math>r_t</math> controls how much past information is forgotten or reset </li>
<li> new memory <math>\tilde{h}_t</math> contains the relevant past memory as instructed by <math>r_t</math> and current information from the input <math>x_t</math> </li>
<li> then <math>z_t</math> is used to control what is passed on from <math>h_{t-1}</math> and <math>(1-z_t)</math> controls the new memory that is passed on
</ol>

We treat <math>h_0</math> and <math> h_{n+1} </math> as zero vectors in the method. Thus, each <math>h_t</math> can be computed as above to yield results for the bi-directional RNN. The resulting hidden states <math>\overrightarrow{h_t}</math> and <math>\overleftarrow{h_t}</math> contain contextual information around the <math> t</math>-th word in forward and backward directions respectively. Contrary to convention, instead of concatenating these two vectors, it is argued that the word's contextual representation is more precise when the context information from different directions is collected and fused using a neural tensor layer as it permits greater interactions among each element of hidden states. Using these two vectors as input to the neural tensor layer, <math>V^i </math>, we compute a new representation that aggregates meanings from the forward and backward hidden states more accurately as follows:

<math>
[\hat{h_t}]_i = tanh(\overrightarrow{h_t}V^i\overleftarrow{h_t} + b_i)
</math>

Where <math>V^i \in \mathbb{R}^{m \times m} </math> is the learned tensor layer, and <math> b_i \in \mathbb{R} </math> is the bias.We repeat this <math> m </math> times with different <math>V^i </math> matrices and <math> b_i </math> vectors. Through the neural tensor layer, each element in <math> [\hat{h_t}]_i </math> can be viewed as a different type of intersection between the forward and backward hidden states. In the model, <math> [\hat{h_t}]_i </math> will have the same size as the forward and backward hidden states. We then concatenate the three hidden states vectors to form a new vector that summarizes the context information :
<math>
\overleftrightarrow{h_t} = [\overrightarrow{h_t}^T,\overleftarrow{h_t}^T,\hat{h_t}]^T
</math>

We calculate this vector for every word in the text and then stack them all into matrix <math> H </math> with shape <math>n</math>-by-<math>3m</math>.

<math>
H = [\overleftrightarrow{h_1};...\overleftrightarrow{h_n}]
</math>

This <math>H</math> matrix is then forwarded as the results from the Recurrent Neural Network.

'''CNN Pipeline:'''

The goal of the CNN pipeline is to learn the relative importance of words in an input sequence based on different aspects. The process of this CNN pipeline is summarized as the following steps:

<ol>
<li> Given a sequence of words, each word is converted into a word vector using the word2vec algorithm which gives matrix X.
</li>

<li> Word vectors are then convolved through the temporal dimension with filters of various sizes (ie. different K) with learnable weights to capture various numerical K-gram representations. These K-gram representations are stored in matrix C.
</li>

<ul>
<li> The convolution makes this process capture local and position-invariant features. Local means the K words are contiguous. Position-invariant means K contiguous words at any position are detected in this case via convolution.

<li> Temporal dimension example: convolve words from 1 to K, then convolve words 2 to K+1, etc
</li>
</ul>

<li> Since not all K-gram representations are equally meaningful, there is a learnable matrix W which takes the linear combination of K-gram representations to more heavily weigh the more important K-gram representations for the classification task.
</li>

<li> Each linear combination of the K-gram representations gives the relative word importance based on the aspect that the linear combination encodes.
</li>

<li> The relative word importance vs aspect gives rise to an interpretable attention matrix A, where each element says the relative importance of a specific word for a specific aspect.
</li>

</ol>

[[File:Group12_Figure1.png |center]]

<div align="center">Figure 1: The architecture of CRNN.</div>

== Merging RNN & CNN Pipeline Outputs ==

The results from both the RNN and CNN pipeline can be merged by simply multiplying the output matrices. That is, we compute <math>S=A^TH</math> which has shape <math>z \times 3m</math> and is essentially a linear combination of the hidden states. The concatenated rows of S results in a vector in <math>\mathbb{R}^{3zm}</math> and can be passed to a fully connected Softmax layer to output a vector of probabilities for our classification task.

To train the model, we make the following decisions:
<ul>
<li> Use cross-entropy loss as the loss function (A cross-entropy loss function usually takes in two distributions, a true distribution p and an estimated distribution q, and measures the average number of bits need to identify an event. This calculation is independent of the kind of layers used in the network as well as the kind of activation being implemented.) </li>
<li> Perform dropout on random columns in matrix C in the CNN pipeline </li>
<li> Perform L2 regularization on all parameters </li>
<li> Use stochastic gradient descent with a learning rate of 0.001 </li>
</ul>

== Interpreting Learned CRNN Weights ==

Recall that attention matrix A essentially stores the relative importance of every word in the input sequence for every aspect chosen. Naturally, this means that A is an n-by-z matrix, with n being the number of words in the input sequence and z being the number of aspects considered in the classification task.

Furthermore, for any specific aspect, words with higher attention values are more important relative to other words in the same input sequence. likewise, for any specific word, aspects with higher attention values prioritize the specific word more than other aspects.

For example, in this paper, a sentence is sampled from the Movie Reviews dataset, and the transpose of attention matrix A is visualized. Each word represents an element in matrix A, the intensity of red represents the magnitude of an attention value in A, and each sentence is the relative importance of each word for a specific context. In the first row, the words are weighted in terms of a positive aspect, in the last row, the words are weighted in terms of a negative aspect, and in the middle row, the words are weighted in terms of a positive and negative aspect. Notice how the relative importance of words is a function of the aspect.

[[File:Interpretation example.png|800px|center]]

From the above sample, it is interesting that the word "but" is viewed as a negative aspect. From a linguistic perspective, the semantic of "but" is incredibly difficult to capture because of the degree of contextual information it needs. In this case, "but" is in the middle of a transition from a negative to a positive so the first row should also have given attention to that word. Also, it seems that the model has learned to give very high attention to the two words directly adjacent to the word of high attention: "is" and "and" beside "powerful", and "an" and "cast" beside "unwieldy".

The paper also shows that the model can determine important words in the news. The authors take Ag's news dataset and randomly select 10 important words for each class. The table (shown below) contains eye-catching words which fit their classes well.

[[File:news_words.png|480px|center]]

== Conclusion & Summary ==

This paper proposed a new architecture, the Convolutional Recurrent Neural Network, for text classification. The Convolutional Neural Network is used to learn the relative importance of each word from their different aspects and stores this information into a weight matrix. The Recurrent Neural Network learns each word's contextual representation through the combination of the forward and backward context information that is fused using a neural tensor layer and is stored as a matrix. These two matrices are then combined to get the text representation used for classification. Although the specifics of the performed tests are lacking, the experiment's results indicate that the proposed method performed well in comparison to most previous methods. In addition to performing well, the proposed method also provides insight into which words contribute greatly to the classification decision as to the learned matrix from the Convolutional Neural Network stores the relative importance of each word. This information can then be used in other applications or analyses. In the future, one can explore the features extracted from the model and use them to potentially learn new methods such as model space. [5]

== Critiques ==

1. In the '''Method''' section of the paper, some explanations used the same notation for multiple different elements of the model. This made the paper harder to follow and understand since they were referring to different elements by identical notation. Additionally, the decision to use sigmoid and hyperbolic tangent functions as nonlinearities for representation learning is not supported with evidence that these are optimal.

2. In '''Comparison of Methods''', the authors discuss the range of hyperparameter settings that they search through. While some of the hyperparameters have a large range of search values, three parameters are fixed without much explanation as to why for all experiments, size of the hidden state of GRU, number of layers, and dropout. These parameters have a lot to do with the complexity of the model and this paper could be improved by providing relevant reasoning behind these values, or by providing additional experimental results over different values of these parameters.

3. In the '''Results''' section of the paper, they tried to show that the classification results from the CRNN model can be better interpreted than other models. In these explanations, the details were lacking and the authors did not adequately demonstrate how their model is better than others.

4. Finally, in the '''Results''' section again, the paper compares the CRNN model to several models which they did not implement and reproduce results with. This can be seen in the chart of results above, where several models do not have entries in the table for all six datasets. Since the authors used a subset of the datasets, these other models which were not reproduced could have different accuracy scores if they had been tested on the same data as the CRNN model. This difference in training and testing data is not mentioned in the paper, and the conclusion that the CRNN model is better in all cases may not be valid.

5. Considering the general methodology, the author in the paper chose to fuse CNN with Gated recurrent unit (GRU), which is only one version of RNN. However, it has been shown that LSTM generally performs better than GRU, and the author should discuss their choice of using GRU in CRNN instead of LSTM in more detail. Looking at the authors' experimental results, LSTM even outperforms CRNN (with GRU) in some cases, which further motivates the idea of adopting LSTM for RNN to combine with CNN. Whether combining LSTM with CNN will lead to a better performance will of course need further verification, but in principle author should at least address the issue.

6. This is an interesting method, I would be curious to see if this can be combined or compared with Quasi-Recurrent Neural Networks (https://arxiv.org/abs/1611.01576). In my experience, QRNNs perform similarly to LSTMs while running significantly faster using convolutions with a special temporal pooling. This seems compatible with the neural tensor layer proposed in this paper, which may be combined to yield stronger performance with faster runtimes.

7. It would be interesting to see how the attention matrix is being constructed and how attention values are being determined in each matrix. For instance, does every different subject have its own attention matrix? If so, how will the situation be handled when the same attention matrix is used in different settings?

8. The paper shows the CRNN model not performing the best with Ag's news and 20newsgroups. It would be interesting to investigate this in detail and see the difference in the way the data is handled in the model compared to the best performing model(self-attentive LSTM in both datasets).

9. From the Interpreting Learned CRNN Weights part, the samples are labeled as positive and negative, and their words all have opposite emotional polarities. It can be observed that regardless of whether the polarity of the example is positive or negative, the keyword can be extracted by this method, reflecting that it can capture multiple semantically meaningful components. At the same time, it will be very interesting to see if this method is applicable to other specific categories.

10. The authors of this paper provide 2 examples of what topic classification is, but do not provide any explicit examples of "polysemic words whose meanings are context-sensitive", one of their main critiques of current methods. This is an opportunity to promote the use of their method and engage and inform the reader, simply by listing examples of these words.

11. In the "Interpreting Learned CRNN Weights" parts, authors gave a figure but did not explain whether this is an example of positive case or negative case. I suggest the author to label the figure. Also, in the paper, we can see there are 2 figures and both of the figures have been labelled as positive or negative but there is only one figure without labelled in the summary.

12. In the CRNN Results vs Benchmarks, it would be better to provide more details and comparison of other methods other than CRNN. It would be clear and more intuitive for the readers why the author will choose CRNN rather than the others.

== Comments ==

- Could this be applied to ancient languages such as hieroglyphs to decipher/better understand them?

- Another application for CRNN might be classifying spoken language as well as any form of audio.

-I think it will be better to show more results by using this method. Maybe it will be better to put the result part after the architecture part? Writing a motivation will be better since it will catch readers' "eyes". I think it will be interesting to ask: whether can we apply this to ancient Chinese poetry? Since there are lots of types of ancient Chinese poetry, doing a classification for them will be interesting.

- In another [https://www.aclweb.org/anthology/W99-0908/ paper] written by Andrew McCallum and Kamal Nigam, they introduce a different method of text classification. Namely, instead of a combination of recurrent and convolutional neural networks, they instead utilized bootstrapping with keywords, Expectation-Maximization algorithm, and shrinkage.

== References ==
----

[1] Grimes, Seth. “Unstructured Data and the 80 Percent Rule.” Breakthrough Analysis, 1 Aug. 2008, breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/.

[2] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modeling sentences,”
arXiv preprint arXiv:1404.2188, 2014.

[3] K. Cho, B. V. Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning
phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.

[4] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in Proceedings
of AAAI, 2015, pp. 2267–2273.

[5] H. Chen, P. Tio, A. Rodan, and X. Yao, “Learning in the model space for cognitive fault diagnosis,” IEEE
Transactions on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 124–136, 2014.

Neural ODEs

2020-11-29T17:18:31Z

D5cui: /* Conclusions and Critiques */

== Introduction ==
Chen et al. propose a new class of neural networks called neural ordinary differential equations (ODEs) in their 2018 paper under the same title. Neural network models, such as residual or recurrent networks, can be generalized as a set of transformations through hidden states (a.k.a layers) <math>\mathbf{h}</math>, given by the equation

<div style="text-align:center;"><math> \mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t,\theta_t) </math> (1) </div>

where <math>t \in \{0,...,T\}</math> and <math>\theta_t</math> corresponds to the set of parameters or weights in state <math>t</math>. It is important to note that it has been shown (Lu et al., 2017)(Haber
and Ruthotto, 2017)(Ruthotto and Haber, 2018) that Equation 1 can be viewed as an Euler discretization. Given this Euler description, if the number of layers and step size between layers are taken to their limits, then Equation 1 can instead be described continuously in the form of the ODE,

<div style="text-align:center;"><math> \frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t),t,\theta) </math> (2). </div>

Equation 2 now describes a network where the output layer <math>\mathbf{h}(T)</math> is generated by solving for the ODE at time <math>T</math>, given the initial value at <math>t=0</math>, where <math>\mathbf{h}(0)</math> is the input layer of the network.

With a vast amount of theory and research in the field of solving ODEs numerically, there are a number of benefits to formulating the hidden state dynamics this way. One major advantage is that a continuous description of the network allows for the calculation of <math>f</math> at arbitrary intervals and locations. The authors provide an example in section five of how the neural ODE network outperforms the discretized version i.e. residual networks, by taking advantage of the continuity of <math>f</math>. A depiction of this distinction is shown in the figure below.

<div style="text-align:center;"> [[File:NeuralODEs_Fig1.png|350px]] </div>

In section four the authors show that the single-unit bottleneck of normalizing flows can be overcome by constructing a new class of density models that incorporates the neural ODE network formulation.
The next section on automatic differentiation will describe how utilizing ODE solvers allows for the calculation of gradients of the loss function without storing any of the hidden state information. This results in a very low memory requirement for neural ODE networks in comparison to traditional networks that rely on intermediate hidden state quantities for backpropagation.

== Reverse-mode Automatic Differentiation of ODE Solutions ==
Like most neural networks, optimizing the weight parameters <math>\theta</math> for a neural ODE network involves finding the gradient of a loss function with respect to those parameters. Differentiating in the forward direction is a simple task, however, this method is very computationally expensive and unstable, as it introduces additional numerical error. Instead, the authors suggest that the gradients can be calculated in the reverse-mode with the adjoint sensitivity method (Pontryagin et al., 1962). This "backpropagation" method solves an augmented version of the forward ODE problem but in reverse, which is something that all ODE solvers are capable of. Section 3 provides results showing that this method gives very desirable memory costs and numerical stability.

The authors provide an example of the adjoint method by considering the minimization of the scalar-valued loss function <math>L</math>, which takes the solution of the ODE solver as its argument.

<div style="text-align:center;">[[File:NeuralODEs_Eq1.png|700px]],</div>
This minimization problem requires the calculation of <math>\frac{\partial L}{\partial \mathbf{z}(t_0)}</math> and <math>\frac{\partial L}{\partial \theta}</math>.

The adjoint itself is defined as <math>\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{z}(t)}</math>, which describes the gradient of the loss with respect to the hidden state <math>\mathbf{z}(t)</math>. By taking the first derivative of the adjoint, another ODE arises in the form of,

<div style="text-align:center;"><math>\frac{d \mathbf{a}(t)}{dt} = -\mathbf{a}(t)^T \frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \mathbf{z}}</math> (3).</div>

Since the value <math>\mathbf{a}(t_0)</math> is required to minimize the loss, the ODE in equation 3 must be solved backwards in time from <math>\mathbf{a}(t_1)</math>. Solving this problem is dependent on the knowledge of the hidden state <math>\mathbf{z}(t)</math> for all <math>t</math>, which an neural ODE does not save on the forward pass. Luckily, both <math>\mathbf{a}(t)</math> and <math>\mathbf{z}(t)</math> can be calculated in reverse, at the same time, by setting up an augmented version of the dynamics and is shown in the final algorithm. Finally, the derivative <math>dL/d\theta</math> can be expressed in terms of the adjoint and the hidden state as,

<div style="text-align:center;"><math> \frac{dL}{d\theta} -\int_{t_1}^{t_0} \mathbf{a}(t)^T\frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \theta}dt</math> (4).</div>

To obtain very inexpensive calculations of <math>\frac{\partial f}{\partial z}</math> and <math>\frac{\partial f}{\partial \theta}</math> in equation 3 and 4, automatic differentiation can be utilized. The authors present an algorithm to calculate the gradients of <math>L</math> and their dependent quantities with only one call to an ODE solver and is shown below.

<div style="text-align:center;">[[File:NeuralODEs Algorithm1.png|850px]]</div>

If the loss function has a stronger dependence on the hidden states for <math>t \neq t_0,t_1</math>, then Algorithm 1 can be modified to handle multiple calls to the ODESolve step since most ODE solvers have the capability to provide <math>z(t)</math> at arbitrary times. A visual depiction of this scenario is shown below.

<div style="text-align:center;">[[File:NeuralODES Fig2.png|350px]]</div>

Please see the [https://arxiv.org/pdf/1806.07366.pdf#page=13 appendix] for extended versions of Algorithm 1 and detailed derivations of each equation in this section.

== Replacing Residual Networks with ODEs for Supervised Learning ==
Section three of the paper investigates an application of the reverse-mode differentiation described in section two, for the training of neural ODE networks on the MNIST digit data set. To solve for the forward pass in the neural ODE network, the following experiment used Adams-Moulton (AM) method, which is an implicit ODE solver. Although it has a marked improvement over explicit ODE solvers in numerical accuracy, integrating backward through the network for backpropagation is still not preferred and the adjoint sensitivity method is used to perform efficient weight optimization. The network with this "backpropagation" technique is referred to as ODE-Net in this section.

=== Implementation ===
A residual network (ResNet), studied by He et al. (2016), with six standard residual blocks was used as a comparative model for this experiment. The competing model, ODE-net, replaces the residual blocks of the ResNet with the AM solver. As a hybrid of the two models ResNet and ODE-net, a third network was created called RK-Net, which solves the weight optimization of the neural ODE network explicitly through backward Runge-Kutta integration. The following table shows the training and performance results of each network.

<div style="text-align:center;">[[File:NeuralODEs Table1.png|400px]]</div>

Note that <math>L</math> and <math>\tilde{L}</math> are the number of layers in ResNet and the number of function calls that the AM method makes for the two ODE networks and are effectively analogous quantities. As shown in Table 1, both of the ODE networks achieve comparable performance to that of the ResNet with a notable decrease in memory cost for ODE-net.

Another interesting component of ODE networks is the ability to control the tolerance in the ODE solver used and subsequently the numerical error in the solution.

<div style="text-align:center;">[[File:NeuralODEs Fig3.png|700px]]</div>

The tolerance of the ODE solver is represented by the colour bar in Figure 3 above and notice that a variety of effects arise from adjusting this parameter. Primarily, if one was to treat the tolerance as a hyperparameter of sorts, you could tune it such that you find a balance between accuracy (Figure 3a) and computational complexity (Figure 3b). Figure 3c also provides further evidence for the benefits of the adjoint method for the backward pass in ODE-nets since there is a nearly 1:0.5 ratio of forward to backward function calls. In the ResNet and RK-Net examples, this ratio is 1:1.

Additionally, the authors loosely define the concept of depth in a neural ODE network by referring to Figure 3d. Here it's evident that as you continue to train an ODE network, the number of function evaluations the ODE solver performs increases. As previously mentioned, this quantity is comparable to the network depth of a discretized network. However, as the authors note, this result should be seen as the progression of the network's complexity over training epochs, which is something we expect to increase over time.

== Continuous Normalizing Flows ==

Section four tackles the implementation of continuous-depth Neural Networks, but to do so, in the first part of section four the authors discuss theoretically how to establish this kind of network through the use of normalizing flows. The authors use a change of variables method presented in other works (Rezende and Mohamed, 2015), (Dinh et al., 2014), to compute the change of a probability distribution if sample points are transformed through a bijective function, <math>f</math>.

<div style="text-align:center;"><math>z_1=f(z_0) \Rightarrow \log⁡(p(z_1))=\log⁡(p(z_0))-\log⁡|\det⁡\frac{\partial f}{\partial z_0}|</math></div>

Where p(z) is the probability distribution of the samples and <math>det⁡\frac{\partial f}{\partial z_0}</math> is the determinant of the Jacobian which has a cubic cost in the dimension of '''z''' or the number of hidden units in the network. The authors discovered however that transforming the discrete set of hidden layers in the normalizing flow network to continuous transformations simplifies the computations significantly, due primarily to the following theorem:

'''''Theorem 1:''' (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability p(z(t)) dependent on time. Let dz/dt=f(z(t),t) be a differential equation describing a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the change in log probability also follows a differential equation:''

<div style="text-align:center;"><math>\frac{\partial \log(p(z(t)))}{\partial t}=-tr\left(\frac{df}{dz(t)}\right)</math></div>

The biggest advantage to using this theorem is that the trace function is a linear function, so if the dynamics of the problem, f, is represented by a sum of functions, then so is the log density. This essentially means that you can now compute flow models with only a linear cost with respect to the number of hidden units, <math>M</math>. In standard normalizing flow models, the cost is <math>O(M^3)</math>, so they will generally fit many layers with a single hidden unit in each layer.

Finally the authors use these realizations to construct Continuous Normalizing Flow networks (CNFs) by specifying the parameters of the flow as a function of ''t'', ie, <math>f(z(t),t)</math>. They also use a gating mechanism for each hidden unit, <math>\frac{dz}{dt}=\sum_n \sigma_n(t)f_n(z)</math> where <math>\sigma_n(t)\in (0,1)</math> is a separate neural network which learns when to apply each dynamic <math>f_n</math>.

===Implementation===

The authors construct two separate types of neural networks to compare against each other, the first is the standard planar Normalizing Flow network (NF) using 64 layers of single hidden units, and the second is their new CNF with 64 hidden units. The NF model is trained over 500,000 iterations using RMSprop, and the CNF network is trained over 10,000 iterations using Adam(algorithm for first-order gradient-based optimization of stochastic objective functions). The loss function is <math>KL(q(x)||p(x))</math> where <math>q(x)</math> is the flow model and <math>p(x)</math> is the target probability density.

One of the biggest advantages when implementing CNF is that you can train the flow parameters just by performing maximum likelihood estimation on <math>\log(q(x))</math> given <math>p(x)</math>, where <math>q(x)</math> is found via the theorem above, and then reversing the CNF to generate random samples from <math>q(x)</math>. This reversal of the CNF is done with about the same cost of the forward pass which is not able to be done in an NF network. The following two figures demonstrate the ability of CNF to generate more expressive and accurate output data as compared to standard NF networks.

<div style="text-align:center;">
[[Image:CNFcomparisons.png]]

[[Image:CNFtransitions.png]]
</div>

Figure 4 shows clearly that the CNF structure exhibits significantly lower loss functions than NF. In figure 5 both networks were tasked with transforming a standard Gaussian distribution into a target distribution, not only was the CNF network more accurate on the two moons target, but also the steps it took along the way are much more intuitive than the output from NF.

== A Generative Latent Function Time-Series Model ==

One of the largest issues at play in terms of Neural ODE networks is the fact that in many instances, data points are either very sparsely distributed, or irregularly-sampled. The latent dynamics are discretized and the observations are in the bins of fixed duration. This creates issues with missing data and ill-defined latent variables. An example of this is medical records which are only updated when a patient visits a doctor or the hospital. To solve this issue the authors had to create a generative time-series model which would be able to fill in the gaps of missing data. The authors consider each time series as a latent trajectory stemming from the initial local state <math>z_{t_0 }</math> and determined from a global set of latent parameters. Given a set of observation times and initial state, the generative model constructs points via the following sample procedure:

<div style="text-align:center;">
<math>
z_{t_0}∼p(z_{t_0})
</math>
</div>

<div style="text-align:center;">
<math>
z_{t_1},z_{t_2},\dots,z_{t_N}={\rm ODESolve}(z_{t_0},f,θ_f,t_0,...,t_N)
</math>
</div>

<div style="text-align:center;">
each
<math>
x_{t_i}∼p(x│z_{t_i},θ_x)
</math>
</div>

<math>f</math> is a function which outputs the gradient <math>\frac{\partial z(t)}{\partial t}=f(z(t),θ_f)</math> which is parameterized via a neural net. In order to train this latent variable model, the authors had to first encode their given data and observation times using an RNN encoder, construct the new points using the trained parameters, then decode the points back into the original space. The following figure describes this process:

<div style="text-align:center;">
[[Image:EncodingFigure.png]]
</div>

Another variable which could affect the latent state of a time-series model is how often an event actually occurs. The authors solved this by parameterizing the rate of events in terms of a Poisson process. They described the set of independent observation times in an interval <math>\left[t_{start},t_{end}\right]</math> as:

<div style="text-align:center;">
<math>
{\rm log}⁡(p(t_1,t_2,\dots,t_N ))=\sum_{i=1}^N{\rm log}⁡(\lambda(z(t_i)))-\int_{t_{start}}^{t_{end}}λ(z(t))dt
</math>
</div>

where <math>\lambda(*)</math> is parameterized via another neural network.

===Implementation===

To test the effectiveness of the Latent time-series ODE model (LODE), they fit the encoder with 25 hidden units, parametrize function f with a one-layer 20 hidden unit network, and the decoder as another neural network with 20 hidden units. They compare this against a standard recurrent neural net (RNN) with 25 hidden units trained to minimize Gaussian log-likelihood. The authors tested both of these network systems on a dataset of 2-dimensional spirals which either rotated clockwise or counter-clockwise and sampled the positions of each spiral at 100 equally spaced time steps. They can then simulate irregularly timed data by taking random amounts of points without replacement from each spiral. The next two figures show the outcome of these experiments:

<div style="text-align:center;">
[[Image:LODEtestresults.png]] [[Image:SpiralFigure.png|The blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model]]
</div>

In the figure on the right the blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model. It is noted that the LODE performs significantly better than the standard RNN model, especially on smaller sets of data points.

== Scope and Limitations ==

Section 6 mainly discusses the scope and limitations of the paper. Firstly, while “batching” the training data is a useful step in standard neural nets, and can still be applied here by combining the ODEs associated with each batch, the authors found that controlling the error, in this case, may increase the number of calculations required. In practice, however, the number of calculations did not increase significantly.

So long as the model proposed in this paper uses finite weights and Lipschitz nonlinearities, Picard’s existence theorem (Coddington and Levinson, 1955) applies, which guarantees that the solution to the IVP exists and is unique. This theorem holds for the model presented above when the network has finite weights and uses nonlinearities in the Lipshitz class.

In controlling the amount of error in the model, the authors were only able to reduce tolerances to approximately <math>10^{-3}</math> and <math>10^{-5}</math> in classification and density estimation respectively without also degrading the computational performance.

The authors believe that reconstructing state trajectories by running the dynamics backwards can introduce extra numerical error. They address a possible solution to this problem by checkpointing certain time steps and storing intermediate values of z on the forward pass. Then while reconstructing, you do each part individually between checkpoints. The authors acknowledged that they informally checked the validity of this method since they don’t consider it a practical problem.

There remain, however, areas where standard neural networks may perform better than Neural ODEs. Firstly, conventional nets can fit non-homeomorphic functions, for example, functions whose output has a smaller dimension that their input, or that change the topology of the input space. However, this could be handled by composing ODE nets with standard network layers. Another point is that conventional nets can be evaluated exactly with a fixed amount of computation, and are typically faster to train, and do not require an error tolerance for a solver.

== Conclusions and Critiques ==

We covered the use of black-box ODE solvers as a model component and their application to initial value problems constructed from real applications. Neural ODE Networks show promising gains in computational cost without large sacrifices in accuracy when applied to certain problems. A drawback of some of these implementations is that the ODE Neural Networks are limited by the underlying distributions of the problems they are trying to solve (requirement of Lipschitz continuity, etc.). There are plenty of further advances to be made in this field as hundreds of years of ODE theory and literature is available, so this is currently an important area of research.

ODEs indeed represent an important area of applied mathematics where neural networks can be used to solve them numerically. Perhaps, a parallel area of investigation can be PDEs (Partial Differential Equations). PDEs are also widely encountered in many areas of applied mathematics, physics, social sciences, and many other fields. It will be interesting to see how neural networks can be used to solve PDEs.

== More Critiques ==

This paper covers the memory efficiency of Neural ODE Networks, but does not address runtime. In practice, most systems are bound by latency requirements more-so than memory requirements (except in edge device cases). Though it may be unreasonable to expect the authors to produce a performance-optimized implementation, it would be insightful to understand the computational bottlenecks so existing frameworks can take steps to address them. This model looks promising and practical performance is the key to enabling future research in this.

The above critique also questions the need for a neural network for such a problem. This problem was studied by Brunel et al. and they presented their solution in their paper ''Parametric Estimation of Ordinary Differential Equations with Orthogonality Conditions''. While this solution also requires iteratively solving a complex optimization problem, they did not require the massive memory and runtime overhead of a neural network. For the neural network solution to demonstrate its potential, it should be including experimental comparisons with specialized ordinary differential equation algorithms instead of simply comparing with a general recurrent neural network.

== References ==
Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. ''arXiv preprint arXiv'':1710.10121, 2017.

Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. ''Inverse Problems'', 34 (1):014004, 2017.

Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. ''arXiv preprint arXiv'':1804.04272, 2018.

Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. ''The mathematical theory of optimal processes''. 1962.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ''European conference on computer vision'', pages 630–645. Springer, 2016b.

Earl A Coddington and Norman Levinson. ''Theory of ordinary differential equations''. Tata McGrawHill Education, 1955.

Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. ''arXiv preprint arXiv:1505.05770'', 2015.

Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. ''arXiv preprint arXiv:1410.8516'', 2014.

Brunel, N. J., Clairon, Q., & d’Alché-Buc, F. (2014). Parametric estimation of ordinary differential equations with orthogonality conditions. ''Journal of the American Statistical Association'', 109(505), 173-185.

A. Iserles. A first course in the numerical analysis of differential equations. Cambridge Texts in Applied Mathematics, second edition, 2009

Streaming Bayesian Inference for Crowdsourced Classification

2020-11-29T13:27:25Z

D5cui: /* Critique */

Group 4 Paper Presentation Summary

By Jonathan Chow, Nyle Dharani, Ildar Nasirov

== Motivation ==
Crowdsourcing can be a useful tool for data generation in classification projects by collecting annotations of large groups. Typically, this takes the form of online questions which many respondents will manually answer for payment. One example of this is Amazon's Mechanical Turk, a website where businesses (or "requesters") will remotely hire individuals(known as "Turkers" or "crowdsourcers") to perform human intelligence tasks (tasks that computers cannot do). In theory, it is effective in processing high volumes of small tasks that would be expensive to achieve in other methods.

The primary limitation of this method to acquire data is that respondents can submit incorrect responses so that we couldn't ensure the data quality.

Therefore, the success of crowdsourcing is limited by how well ground-truth can be determined. The primary method for doing so is probabilistic inference. However, current methods are computationally expensive, lack theoretical guarantees, or are limited to specific settings. In the meantime, there are some approaches to focus on how we sample the data from the crowd, rather than how we aggregate it to improve the accuracy of the system.

== Dawid-Skene Model for Crowdsourcing ==
The one-coin Dawid-Skene model is popular for contextualizing crowdsourcing problems. For task <math>i</math> in set <math>M</math>, let the ground-truth be the binary <math>y_i = {\pm 1}</math>. We get labels <math>X = {x_{ij}}</math> where <math>j \in N</math> is the index for that worker.

We assume that we interact with workers in sequential fashion. At each time step <math>t</math>, a worker <math>j = a(t) </math> provides their label for an assigned task <math>i</math> and provides the label <math>x_{ij} = {\pm 1}</math>. We denote responses up to time <math>t</math> via superscript.

We let <math>x_{ij} = 0</math> if worker <math>j</math> has not completed task <math>i</math>. We assume that <math>P(x_{ij} = y_i) = p_j</math>. This implies that each worker is independent and has equal probability of correctly labelling regardless of task. In crowdsourcing the data, we must determine how workers are assigned to tasks. We introduce two methods.

Under uniform sampling, workers are allocated to tasks such that each task is completed by the same number of workers, rounded to the nearest integer, and no worker completes a task more than once. This policy is given by <center><math>\pi_{uni}(t) = {\rm argmin}_{i \notin M_{a(t)}^t}\{ | N_i^t | \}.</math></center>

Under uncertainty sampling, we assign more workers to tasks that are less certain. Assuming, we are able to estimate the posterior probability of ground-truth, we can allocate workers to the task with the lowest probability of falling into the predicted class. This policy is given by <center><math>\pi_{us}(t) = {\rm argmin}_{i \notin M_{a(t)}^t}\{ (max_{k \in \{\pm 1\}} ( P(y_i = k | X^t) ) \}.</math></center>

We then need to aggregate the data. The simple method of majority voting makes predictions for a given task based on the class the most workers have assigned it, <math>\hat{y}_i = \text{sign}\{\sum_{j \in N_i} x_{ij}\}</math>.

== Streaming Bayesian Inference for Crowdsourced Classification (SBIC) ==
The aim of the SBIC algorithm is to estimate the posterior probability, <math>P(y, p | X^t, \theta)</math> where <math>X^t</math> are the observed responses at time <math>t</math> and <math>\theta</math> is our prior. We can then generate predictions <math>\hat{y}^t</math> as the marginal probability over each <math>y_i</math> given <math>X^t</math>, and <math>\theta</math>.

We factor <math>P(y, p | X^t, \theta) \approx \prod_{i \in M} \mu_i^t (y_i) \prod_{j \in N} \nu_j^t (p_j) </math> where <math>\mu_i^t</math> corresponds to each task and <math>\nu_j^t</math> to each worker.

We then sequentially optimize the factors <math>\mu^t</math> and <math>\nu^t</math>. We begin by assuming that the worker accuracy follows a beta distribution with parameters <math>\alpha</math> and <math>\beta</math>. Initialize the task factors <math>\mu_i^0(+1) = q</math> and <math>\mu_i^0(-1) = 1 – q</math> for all <math>i</math>.

When a new label is observed at time <math>t</math>, we update the <math>\nu_j^t</math> of worker <math>j</math>. We then update <math>\mu_i</math>. These updates are given by

<center><math>\nu_j^t(p_j) \sim \text{Beta}(\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_{ij}) + \alpha, \sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(-x_{ij}) + \beta) </math></center>

<center><math>\mu_i^t(y_i) \propto \begin{cases} \mu_i^{t - 1}(y_i)\overline{p}_j^t & x_{ij} = y_i \\ \mu_i^{t - 1}(y_i)(1 - \overline{p}_j^t) & x_{ij} \ne y_i \end{cases}</math></center>
where <math>\hat{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_{ij}) + \alpha}{|M_j^{t - 1}| + \alpha + \beta }</math>.

We choose our predictions to be the maximum <math>\mu_i^t(k) </math> for <math>k= \{-1,1\}</math>.

Depending on our ordering of labels <math>X</math>, we can select for different applications.

== Fast SBIC ==
The pseudocode for Fast SBIC is shown below.

<center>[[Image:FastSBIC.png|800px|]]</center>

As the name implies, the goal of this algorithm is speed. To facilitate this, we leave the order of <math>X</math> unchanged.

We express <math>\mu_i^t</math> in terms of its log-odds
<center><math>z_i^t = \log(\frac{\mu_i^t(+1)}{ \mu_i^t(-1)}) = z_i^{t - 1} + x_{ij} \log(\frac{\overline{p}_j^t}{1 - \overline{p}_j^t })</math></center>
where <math>z_i^0 = \log(\frac{q}{1 - q})</math>.

The product chain then becomes a summation and removes the need to normalize each <math>\mu_i^t</math>. We use these log-odds to compute worker accuracy,

<center><math>\overline{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} \sigma(x_{ij} z_i^{t-1}) + \alpha}{|M_j^{t - 1}| + \alpha + \beta}</math></center>
where <math>\sigma(z_i^{t-1}) := \frac{1}{1 + exp(-z_i^{t - 1})} = \mu_i^{t - 1}(+1) </math>

The final predictions are made by choosing class <math>\hat{y}_i^T = \text{sign}(z_i^T) </math>. We see later that Fast SBIC has similar computational speed to majority voting.

== Sorted SBIC ==
To increase the accuracy of the SBIC algorithm in exchange for computational efficiency, we run the algorithm in parallel giving labels in different orders. The pseudocode for this algorithm is given below.

<center>[[Image:SortedSBIC.png|800px|]]</center>

From the general discussion of SBIC, we know that predictions on task <math>i</math> are more accurate toward the end of the collection process. This is a result of observing more data points and having run more updates on <math>\mu_i^t</math> and <math>\nu_j^t</math> to move them further from their prior. This means that task <math>i</math> is predicted more accurately when its corresponding labels are seen closer to the end of the process.

We take advantage of this property by maintaining a distinct “view” of the log-odds for each task. When a label is observed, we update views for all tasks except the one for which the label was observed. At the end of the collection process, we process skipped labels. When run online, this process must be repeated at every timestep.

We see that sorted SBIC is slower than Fast SBIC by a factor of M, the number of tasks. However, we can reduce the complexity by viewing <math>s^k</math> across different tasks in an offline setting when the whole dataset is known in advance.

== Theoretical Analysis ==
The authors prove an exponential relationship between the error probability and the number of labels per task. This allows for an upper bound to be placed on the error, in the asymptotic regime, where it can be assumed the SBIC predictions are close to the true values. The two theorems, for the different sampling regimes, are presented below.

<center>[[Image:Theorem1.png|800px|]]</center>

<center>[[Image:Theorem2.png|800px|]]</center>

== Empirical Analysis ==
The purpose of the empirical analysis is to compare SBIC to the existing state of the art algorithms. The SBIC algorithm is run on five real-world binary classification datasets. The results can be found in the table below. Other algorithms in the comparison are, from left to right, majority voting, expectation-maximization, mean-field, belief-propagation, Monte-Carlo sampling, and triangular estimation.

First of all, the algorithms are run on a synthetic data that meets the assumptions of an underlying one-coin Dawid-Skene model, which allows the authors to compare SBIC's performance empirically with the theoretical results previously shown.

<center>[[Image:RealWorldResults.png|800px|]]</center>

In bold are the best performing algorithms for each dataset. We see that both versions of the SBIC algorithm are competitive, having similar prediction errors to EM, AMF, and MC. All are considered state-of-the-art Bayesian algorithms.

The figure below shows the average time required to simulate predictions on synthetic data under an uncertainty sampling policy. We see that Fast SBIC is comparable to majority voting and significantly faster than the other algorithms. This speed improvement, coupled with comparable accuracy, makes the Fast SBIC algorithm powerful.

<center>[[Image:TimeRequirement.png|800px|]]</center>

== Conclusion and Future Research ==
In conclusion, we have seen that SBIC is computationally efficient, accurate in practice, and has theoretical guarantees. The authors intend to extend the algorithm to the multi-class case in the future.

== Critique ==
In crowdsourcing data, the cost associated with collecting additional labels is not usually prohibitively expensive. As a result, if there is concern over ground-truth, paying for additional data to ensure X is sufficiently dense may be the desired response as opposed to sacrificing ground-truth accuracy. This may result in the SBIC algorithm being less practically useful than intended. Perhaps, a study can be done to compare the cost of acquiring additional data and checking how much it improves the accuracy of ground-truth. This can help us decide the usefulness of this algorithm. Second, SBIC should be used in multiple types of data like audio, text, and images to make sure that it delivers consistent results.

The paper is tackling the classic problem of aggregating labels in a crowdsourced application, with a focus on speed. The algorithms proposed are fast and simple to implement and come with theoretical guarantees on the bounds for error rates. However, the paper starts with an objective of designing fast label aggregation algorithms for a streaming setting yet doesn’t spend any time motivating the applications in which such algorithms are needed. All the datasets used in the empirical analysis are static datasets therefore for the paper to be useful, the problem considered should be well motivated. It also appears that the output from the algorithm depends on the order in which the data is processed, which may need to be clarified. Finally, the theoretical results are presented under the assumption that the predictions of the FBI converge to the ground truth, however, the reasoning behind this assumption is not explained.

The paper assumes that crowdsourcing from human beings is systematic: that is, respondents to the problems would act in similar ways that can be classified into some categories. There are lots of other factors that need to be considered for a human respondent, such as fatigue effects and conflicts of interest. Those factors would seriously jeopardize the validity of the results and the model if they were not carefully designed and taken care of. For example, one formally accurate subject reacts badly to the subject one day generating lots of faulty data, and it would take lots of correct votes to even out the effects. Even in lots of medical experiment that involves human subjects, with rigorous standards and procedures, the results could still be invalid. The trade-off for speed while sacrificing the validity of results is not wise.

When introducing the Streaming Bayesian Inference for Crowdsourced Classification method, it was explicitly mentioned that one can select for different applications based on the ordering of labels X. However, "applications" mentioned here were not further explained or explored to support the effectiveness and meaning of developing such an algorithm. Thus, it would be sufficient for the author to build a connection between the proposed algorithm and its real-world application to make this summary more purposeful and engaging.

== References ==
[1] Manino, Tran-Thanh, and Jennings. Streaming Bayesian Inference for Crowdsourced Classification. 33rd Conference on Neural Information Processing Systems, 2019

User:Gtompkin

2020-11-27T17:00:04Z

D5cui: /* Related Work */

== Presented by ==
Grace Tompkins, Tatiana Krikella, Swaleh Hussain

== Introduction ==

One of the fundamental challenges in machine learning and data science is dealing with missing and incomplete data. This paper proposes a theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of data by imputation or other commonly used methods in the existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.

== Related Work ==

Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and <math>k</math>-nearest neighbours (k-NN) imputation. In the former (mean imputation), the missing value is replaced by the mean of all available values of that feature in the dataset. In the latter (k-NN imputation), the missing value is also replaced by the mean, however it is now computed using only the k "closest" samples in the dataset. If the dataset is numerical, Euclidean distance could be used as a measure of "closeness", for example. Other methods for dealing with missing data involve training separate neural networks and extreme learning machines. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Further, the decision function can also be trained using available/visible inputs alone. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine that could perform classification on data with missingness in the inputs.

== Layer for Processing Missing Data ==

In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation.

Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset \{1,...,D\} </math> is a set of attributes with missing data. <math>(x,J)</math> therefore represents an "incomplete" data point for which <math>|J|</math>-many entries are unknown - examples of this could be a list of daily temperature readings over a week where temperature was not recorded on the third day (<math>x\in \mathbb{R}^7, J= \{3\}</math>), an audio transcript that goes silent for certain timespans, or images that are partially masked out (as discussed in the examples).

For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>:

<center><math>S=Aff[x,J]=span(e_J) </math></center>
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.

Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>,

<center><math>F_S(x) = \begin{cases}
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\
0 & \text{otherwise.}
\end{cases} </math></center>

To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as:

<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center>
provided that the expectation exists.

The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively.

'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have

<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center>

where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.

'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have:

<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center>

In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.

<math> </math>

== Theoretical Analysis ==

The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information.

Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center>

'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</math></center> then <math>\nu = \mu</math>.

Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows.

''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <center><math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}.</math></center> The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>).

Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem.

Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <center><math>exp(ir) = cos(r) + i sin(r) \in F_w</math></center> and we have the equality <center><math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x).</math></center> Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> as the characteristic functions of the two measures coincide.

A result analogous to Theorem 4.1 for RBF can also be obtained.

'''Theorem 2.1''' Let <math>\mu, \nu</math> be probabilistic measures. If
$$ RBF_{m,\alpha}(\mu) = RBF_{m,\alpha}(\nu) \text{ for every } m \in \mathbb{R}^D, \alpha > 0,$$
then <math>\nu = \mu</math>.

More general results can be obtained making stronger assumptions on the probability measures. For example, if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.

'''Theorem 2.2''' Let <math>\mu, \nu</math> be probabilistic measures with compact support. Let <math>\mathcal{N}</math> be a family of functions having UAP. If
$$n(\mu) = n(\nu) \text{ for every } n \in \mathcal{N},$$
then <math>\nu = \mu</math>.

A detailed proof Theorems 2.1 and 2.2 can be found in section 2 of the Supplementary Materials, which can be downloaded [https://proceedings.neurips.cc/paper/2018/hash/411ae1bf081d1674ca6091f8c59a266f-Abstract.html here].

== Experimental Results ==
The model was applied to three types of algorithms: an Autoencoder (AE), a multilayer perceptron, and a radial basis function network.

'''Autoencoder'''

Corrupted images were restored as a part of this experiment. Grayscale handwritten digits were obtained from the MNIST database. A 13 by 13 (169 pixels) square was removed from each 28 by 28 (784 pixels) image. The location of the square was uniformly sampled for each image. The autoencoder used included 5 hidden layers. The first layer used ReLU activation functions while the subsequent layers utilized sigmoids. The loss function was computed using pixels from outside the mask.

Popular imputation techniques were compared against the conducted experiment:

''k-nn:'' Replaced missing features with the mean of respective features calculated using K nearest training samples. Here, K=5.

''mean:'' Replaced missing features with the mean of respective features calculated using all incomplete training samples.

''dropout:'' Dropped input neutrons with missing values.

Moreover, a context encoder (CE) was trained by replacing missing features with their means. Unlike mean imputation, the complete data was used in the training phase. The method under study performed better than the imputation methods inside and outside the mask. Additionally, the method under study outperformed CE based on the whole area and area outside the mask.

The mean square error of reconstruction is used to test each method. The errors calculated over the whole area, inside and outside the mask are shown in Table 1, which indicates the method introduced in this paper is the most competitive.

[[File:Group3_Table1.png |center]]
<div align="center">Table 1: Mean square error of reconstruction on MNIST incomplete images scaled to [0, 1]</div>

'''Multilayer Perceptron'''

A multilayer perceptron with 3 ReLU hidden layers was applied to a multi-class classification problem on the Epileptic Seizure Recognition (ESR) data set taken from [3]. Each 178-dimensional vector (out of 11500 samples) is the EEG recording of a given person for 1 second, categorized into one of 5 classes. To generate missing attributes, 25%, 50%, 75%, and 90% of observations were randomly removed. The aforementioned imputation methods were used in addition to Multiple Imputation by Chained Equation (mice) and a mixture of Gaussians (gmm). The former utilizes the conditional distribution of data by Markov chain Monte Carlo techniques to draw imputations. The latter replaces missing features with values sampled from GMM estimated from incomplete data using the EM algorithm.

Double 5-fold cross-validation was used to report classification results. The classical accuracy measure is usually being used for accessing the results.The model under study outperformed classical imputation methods, which give reasonable results only for a low number of missing values. The method under study performs nearly as well as CE, even though CE had access to complete training data.

'''Radial Basis Function Network'''

RBFN can be considered as a minimal architecture implementing our model, which contains only one hidden layer. A cross-entropy function was applied to a softmax in the output layer. Two-class data sets retrieved from the UCI repository [4] with internally missing attributes were used. Since the classification is binary, two additional SVM kernel models which work directly with incomplete data without performing any imputations were included in the experiment:

''geom:'' The objective function is based on the geometric interpretation of the margin and aims to maximize the margin of each sample in its own subspace [5].

''karma:'' This algorithm iteratively tunes kernel classifier under low-rank assumptions [6].

The above SVM methods were combined with RBF kernel function. The number of RBF units was selected in the inner cross-validation from the range {25, 50, 75, 100}. Initial centers of RBFNs were randomly selected from training data while variances were samples from <math>N(0,1)</math> distribution. For SVM methods, the margin parameter <math>C</math> and kernel radius <math>\gamma</math> were selected from <math>\{2^k :k=−5,−3,...,9\}</math> for both parameters. For karma, additional parameter <math>\gamma_{karma}</math> was selected from the set <math>\{1, 2\}</math>.

The model under study outperformed imputation techniques in almost all cases. It partially confirms that the use of raw incomplete data in neural networks is a usually better approach than filling missing attributes before the learning process. Moreover, it obtained more accurate results than modified kernel methods, which directly work on incomplete data. The performance of the model was once again comparable to, and in some cases better than CE, which had access to the complete data.

== Conclusion ==

The results with these experiments along with the theoretical results conclude that this novel approach for dealing with missing data through a modification of a neural network is beneficial and outperforms many existing methods. This approach, which utilizes representing missing data with a probability density function, allows a neural network to determine a more generalized and accurate response of the neuron.

== Critiques ==

- A simulation study where the mechanism of missingness is known will be interesting to examine. Doing this will allow us to see when the proposed method is better than existing methods, and under what conditions.

- This method of imputing incomplete data has many limitations: in most cases when we have a missing data point we are facing a relatively small amount of data that does not require training of a neural network. For a large dataset, missing records does not seem to be very crucial because obtaining data will be relatively easier, and using an empirical way of imputing data such as a majority vote will be sufficient.

- An interesting application of this problem is in NLP. In NLP, especially Question Answering, there is a problem where a query is given and an answer must be retrieved, but the knowledge base is incomplete. There is therefore a requirement for the model to be able to infer information from the existing knowledge base in order to answer the question. Although this problem is a little more contrived than the one mentioned here, it is nevertheless similar in nature because it requires the ability to probabilistically determine some value which can then be used as a response.

- The experiments in this paper evaluate this method against low amounts of missing data. It would be interesting to see the properties of this imputation when a majority of the data is missing, and see if this method can outperform dropout training in this setting (dropout is known to be surprisingly robust even at high drop levels).

- This problem can possibly be applied to face recognition where given a blurry image of a person's face, the neural network can make the image clearer such that the face of the person would be visible for humans to see and also possible for the software to identify who the person is.

- This novel approach can also be applied to restoring damaged handwritten historical documents. By feeding in a damaged document with portions of unreadable texts, the neural network can add missing information utilizing a trained context encoder

- It will be interesting to see how this method performs with audio data, i.e. say if there are parts of an audio file that are missing, whether the neural network will be able to learn the underlying distribution and impute the missing sections of speech.

- In general, data are usually missing in a specific part of the content. For example, old books usually have first couple page or last couple pages that are missing. It would be interesting to see that how the distribution of "missing data" will be applied in those cases.

- In this paper, the researchers were able to outperform existing imputation methods using neural networks. It would be really nice to see how does the usage of neural networks impact the need for amount of data, and how much more training is required in comparison to the other algorithms provided in this paper.

- It might be an interesting approach to investigate how the size of missing data may influence the training. For example, in the MNIST AutoEncoder, some algorithms involve masks to generate more general images and avoid overfitting. The approach could compare the result by changing the size of the "missing" part to illustrate to what degree can we ignore the missing data and view them as assistance.

- It would be nice to see how the method under study outperforms both the Multilayer Perceptron and Radial Basis Function Network methods mentioned in the summary. Since these two methods are placed under the Experimental Results section, it is expected that they consist more justifications and supporting evidence (analytical results, tables, graphs, etc.) then what have been presented, so that it is more convincing for the readers.

== References ==
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous
data. In Advances in neural information processing systems, pages 395–401, 1996.

[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[3] Ralph G Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and Christian E Elger. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical Review E, 64(6):061907, 2001.

[4] Arthur Asuncion and David J. Newman. UCI Machine Learning Repository, 2007.

[5] Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, and Daphne Koller. Max-margin classification of data with absent features. Journal of Machine Learning Research, 9:1–21, 2008.

[6] Elad Hazan, Roi Livni, and Yishay Mansour. Classification with low rank and missing data. In Proceedings of The 32nd International Conference on Machine Learning, pages 257–266, 2015.

User:Gtompkin

2020-11-27T16:59:24Z

D5cui: /* Introduction */

== Presented by ==
Grace Tompkins, Tatiana Krikella, Swaleh Hussain

== Introduction ==

One of the fundamental challenges in machine learning and data science is dealing with missing and incomplete data. This paper proposes a theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of data by imputation or other commonly used methods in the existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.

== Related Work ==

Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and <math>k</math>-nearest neighbours (k-NN) imputation. In the former (mean imputation), the missing value is replaced by the mean of all available values of that feature in the dataset. In the latter (k-NN imputation), the missing value is also replaced by the mean, however it is now computed using only the k "closest" samples in the dataset. If the dataset is numerical, Euclidean distance could be used as a measure of "closeness", for example. Other methods for dealing with missing data involve training separate neural networks and extreme learning machines. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Further, the decision function can also be trained using available/visible inputs alone. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.

== Layer for Processing Missing Data ==

In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation.

Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset \{1,...,D\} </math> is a set of attributes with missing data. <math>(x,J)</math> therefore represents an "incomplete" data point for which <math>|J|</math>-many entries are unknown - examples of this could be a list of daily temperature readings over a week where temperature was not recorded on the third day (<math>x\in \mathbb{R}^7, J= \{3\}</math>), an audio transcript that goes silent for certain timespans, or images that are partially masked out (as discussed in the examples).

For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>:

<center><math>S=Aff[x,J]=span(e_J) </math></center>
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.

Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>,

<center><math>F_S(x) = \begin{cases}
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\
0 & \text{otherwise.}
\end{cases} </math></center>

To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as:

<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center>
provided that the expectation exists.

The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively.

'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have

<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center>

where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.

'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have:

<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center>

In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.

<math> </math>

== Theoretical Analysis ==

The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information.

Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center>

'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</math></center> then <math>\nu = \mu</math>.

Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows.

''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <center><math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}.</math></center> The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>).

Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem.

Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <center><math>exp(ir) = cos(r) + i sin(r) \in F_w</math></center> and we have the equality <center><math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x).</math></center> Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> as the characteristic functions of the two measures coincide.

A result analogous to Theorem 4.1 for RBF can also be obtained.

'''Theorem 2.1''' Let <math>\mu, \nu</math> be probabilistic measures. If
$$ RBF_{m,\alpha}(\mu) = RBF_{m,\alpha}(\nu) \text{ for every } m \in \mathbb{R}^D, \alpha > 0,$$
then <math>\nu = \mu</math>.

More general results can be obtained making stronger assumptions on the probability measures. For example, if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.

'''Theorem 2.2''' Let <math>\mu, \nu</math> be probabilistic measures with compact support. Let <math>\mathcal{N}</math> be a family of functions having UAP. If
$$n(\mu) = n(\nu) \text{ for every } n \in \mathcal{N},$$
then <math>\nu = \mu</math>.

A detailed proof Theorems 2.1 and 2.2 can be found in section 2 of the Supplementary Materials, which can be downloaded [https://proceedings.neurips.cc/paper/2018/hash/411ae1bf081d1674ca6091f8c59a266f-Abstract.html here].

== Experimental Results ==
The model was applied to three types of algorithms: an Autoencoder (AE), a multilayer perceptron, and a radial basis function network.

'''Autoencoder'''

Corrupted images were restored as a part of this experiment. Grayscale handwritten digits were obtained from the MNIST database. A 13 by 13 (169 pixels) square was removed from each 28 by 28 (784 pixels) image. The location of the square was uniformly sampled for each image. The autoencoder used included 5 hidden layers. The first layer used ReLU activation functions while the subsequent layers utilized sigmoids. The loss function was computed using pixels from outside the mask.

Popular imputation techniques were compared against the conducted experiment:

''k-nn:'' Replaced missing features with the mean of respective features calculated using K nearest training samples. Here, K=5.

''mean:'' Replaced missing features with the mean of respective features calculated using all incomplete training samples.

''dropout:'' Dropped input neutrons with missing values.

Moreover, a context encoder (CE) was trained by replacing missing features with their means. Unlike mean imputation, the complete data was used in the training phase. The method under study performed better than the imputation methods inside and outside the mask. Additionally, the method under study outperformed CE based on the whole area and area outside the mask.

The mean square error of reconstruction is used to test each method. The errors calculated over the whole area, inside and outside the mask are shown in Table 1, which indicates the method introduced in this paper is the most competitive.

[[File:Group3_Table1.png |center]]
<div align="center">Table 1: Mean square error of reconstruction on MNIST incomplete images scaled to [0, 1]</div>

'''Multilayer Perceptron'''

A multilayer perceptron with 3 ReLU hidden layers was applied to a multi-class classification problem on the Epileptic Seizure Recognition (ESR) data set taken from [3]. Each 178-dimensional vector (out of 11500 samples) is the EEG recording of a given person for 1 second, categorized into one of 5 classes. To generate missing attributes, 25%, 50%, 75%, and 90% of observations were randomly removed. The aforementioned imputation methods were used in addition to Multiple Imputation by Chained Equation (mice) and a mixture of Gaussians (gmm). The former utilizes the conditional distribution of data by Markov chain Monte Carlo techniques to draw imputations. The latter replaces missing features with values sampled from GMM estimated from incomplete data using the EM algorithm.

Double 5-fold cross-validation was used to report classification results. The classical accuracy measure is usually being used for accessing the results.The model under study outperformed classical imputation methods, which give reasonable results only for a low number of missing values. The method under study performs nearly as well as CE, even though CE had access to complete training data.

'''Radial Basis Function Network'''

RBFN can be considered as a minimal architecture implementing our model, which contains only one hidden layer. A cross-entropy function was applied to a softmax in the output layer. Two-class data sets retrieved from the UCI repository [4] with internally missing attributes were used. Since the classification is binary, two additional SVM kernel models which work directly with incomplete data without performing any imputations were included in the experiment:

''geom:'' The objective function is based on the geometric interpretation of the margin and aims to maximize the margin of each sample in its own subspace [5].

''karma:'' This algorithm iteratively tunes kernel classifier under low-rank assumptions [6].

The above SVM methods were combined with RBF kernel function. The number of RBF units was selected in the inner cross-validation from the range {25, 50, 75, 100}. Initial centers of RBFNs were randomly selected from training data while variances were samples from <math>N(0,1)</math> distribution. For SVM methods, the margin parameter <math>C</math> and kernel radius <math>\gamma</math> were selected from <math>\{2^k :k=−5,−3,...,9\}</math> for both parameters. For karma, additional parameter <math>\gamma_{karma}</math> was selected from the set <math>\{1, 2\}</math>.

The model under study outperformed imputation techniques in almost all cases. It partially confirms that the use of raw incomplete data in neural networks is a usually better approach than filling missing attributes before the learning process. Moreover, it obtained more accurate results than modified kernel methods, which directly work on incomplete data. The performance of the model was once again comparable to, and in some cases better than CE, which had access to the complete data.

== Conclusion ==

The results with these experiments along with the theoretical results conclude that this novel approach for dealing with missing data through a modification of a neural network is beneficial and outperforms many existing methods. This approach, which utilizes representing missing data with a probability density function, allows a neural network to determine a more generalized and accurate response of the neuron.

== Critiques ==

- A simulation study where the mechanism of missingness is known will be interesting to examine. Doing this will allow us to see when the proposed method is better than existing methods, and under what conditions.

- This method of imputing incomplete data has many limitations: in most cases when we have a missing data point we are facing a relatively small amount of data that does not require training of a neural network. For a large dataset, missing records does not seem to be very crucial because obtaining data will be relatively easier, and using an empirical way of imputing data such as a majority vote will be sufficient.

- An interesting application of this problem is in NLP. In NLP, especially Question Answering, there is a problem where a query is given and an answer must be retrieved, but the knowledge base is incomplete. There is therefore a requirement for the model to be able to infer information from the existing knowledge base in order to answer the question. Although this problem is a little more contrived than the one mentioned here, it is nevertheless similar in nature because it requires the ability to probabilistically determine some value which can then be used as a response.

- The experiments in this paper evaluate this method against low amounts of missing data. It would be interesting to see the properties of this imputation when a majority of the data is missing, and see if this method can outperform dropout training in this setting (dropout is known to be surprisingly robust even at high drop levels).

- This problem can possibly be applied to face recognition where given a blurry image of a person's face, the neural network can make the image clearer such that the face of the person would be visible for humans to see and also possible for the software to identify who the person is.

- This novel approach can also be applied to restoring damaged handwritten historical documents. By feeding in a damaged document with portions of unreadable texts, the neural network can add missing information utilizing a trained context encoder

- It will be interesting to see how this method performs with audio data, i.e. say if there are parts of an audio file that are missing, whether the neural network will be able to learn the underlying distribution and impute the missing sections of speech.

- In general, data are usually missing in a specific part of the content. For example, old books usually have first couple page or last couple pages that are missing. It would be interesting to see that how the distribution of "missing data" will be applied in those cases.

- In this paper, the researchers were able to outperform existing imputation methods using neural networks. It would be really nice to see how does the usage of neural networks impact the need for amount of data, and how much more training is required in comparison to the other algorithms provided in this paper.

- It might be an interesting approach to investigate how the size of missing data may influence the training. For example, in the MNIST AutoEncoder, some algorithms involve masks to generate more general images and avoid overfitting. The approach could compare the result by changing the size of the "missing" part to illustrate to what degree can we ignore the missing data and view them as assistance.

- It would be nice to see how the method under study outperforms both the Multilayer Perceptron and Radial Basis Function Network methods mentioned in the summary. Since these two methods are placed under the Experimental Results section, it is expected that they consist more justifications and supporting evidence (analytical results, tables, graphs, etc.) then what have been presented, so that it is more convincing for the readers.

== References ==
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous
data. In Advances in neural information processing systems, pages 395–401, 1996.

[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[3] Ralph G Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and Christian E Elger. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical Review E, 64(6):061907, 2001.

[4] Arthur Asuncion and David J. Newman. UCI Machine Learning Repository, 2007.

[5] Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, and Daphne Koller. Max-margin classification of data with absent features. Journal of Machine Learning Research, 9:1–21, 2008.

[6] Elad Hazan, Roi Livni, and Yishay Mansour. Classification with low rank and missing data. In Proceedings of The 32nd International Conference on Machine Learning, pages 257–266, 2015.

User:Gtompkin

2020-11-27T16:56:12Z

D5cui: /* Experimental Results */

== Presented by ==
Grace Tompkins, Tatiana Krikella, Swaleh Hussain

== Introduction ==

One of the fundamental challenges in machine learning and data science is dealing with missing and incomplete data. This paper proposes a theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.

== Related Work ==

Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and <math>k</math>-nearest neighbours (k-NN) imputation. In the former (mean imputation), the missing value is replaced by the mean of all available values of that feature in the dataset. In the latter (k-NN imputation), the missing value is also replaced by the mean, however it is now computed using only the k "closest" samples in the dataset. If the dataset is numerical, Euclidean distance could be used as a measure of "closeness", for example. Other methods for dealing with missing data involve training separate neural networks and extreme learning machines. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Further, the decision function can also be trained using available/visible inputs alone. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.

== Layer for Processing Missing Data ==

In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation.

Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset \{1,...,D\} </math> is a set of attributes with missing data. <math>(x,J)</math> therefore represents an "incomplete" data point for which <math>|J|</math>-many entries are unknown - examples of this could be a list of daily temperature readings over a week where temperature was not recorded on the third day (<math>x\in \mathbb{R}^7, J= \{3\}</math>), an audio transcript that goes silent for certain timespans, or images that are partially masked out (as discussed in the examples).

For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>:

<center><math>S=Aff[x,J]=span(e_J) </math></center>
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.

Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>,

<center><math>F_S(x) = \begin{cases}
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\
0 & \text{otherwise.}
\end{cases} </math></center>

To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as:

<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center>
provided that the expectation exists.

The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively.

'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have

<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center>

where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.

'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have:

<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center>

In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.

<math> </math>

== Theoretical Analysis ==

The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information.

Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center>

'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</math></center> then <math>\nu = \mu</math>.

Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows.

''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <center><math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}.</math></center> The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>).

Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem.

Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <center><math>exp(ir) = cos(r) + i sin(r) \in F_w</math></center> and we have the equality <center><math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x).</math></center> Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> as the characteristic functions of the two measures coincide.

A result analogous to Theorem 4.1 for RBF can also be obtained.

'''Theorem 2.1''' Let <math>\mu, \nu</math> be probabilistic measures. If
$$ RBF_{m,\alpha}(\mu) = RBF_{m,\alpha}(\nu) \text{ for every } m \in \mathbb{R}^D, \alpha > 0,$$
then <math>\nu = \mu</math>.

More general results can be obtained making stronger assumptions on the probability measures. For example, if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.

'''Theorem 2.2''' Let <math>\mu, \nu</math> be probabilistic measures with compact support. Let <math>\mathcal{N}</math> be a family of functions having UAP. If
$$n(\mu) = n(\nu) \text{ for every } n \in \mathcal{N},$$
then <math>\nu = \mu</math>.

A detailed proof Theorems 2.1 and 2.2 can be found in section 2 of the Supplementary Materials, which can be downloaded [https://proceedings.neurips.cc/paper/2018/hash/411ae1bf081d1674ca6091f8c59a266f-Abstract.html here].

== Experimental Results ==
The model was applied to three types of algorithms: an Autoencoder (AE), a multilayer perceptron, and a radial basis function network.

'''Autoencoder'''

Corrupted images were restored as a part of this experiment. Grayscale handwritten digits were obtained from the MNIST database. A 13 by 13 (169 pixels) square was removed from each 28 by 28 (784 pixels) image. The location of the square was uniformly sampled for each image. The autoencoder used included 5 hidden layers. The first layer used ReLU activation functions while the subsequent layers utilized sigmoids. The loss function was computed using pixels from outside the mask.

Popular imputation techniques were compared against the conducted experiment:

''k-nn:'' Replaced missing features with the mean of respective features calculated using K nearest training samples. Here, K=5.

''mean:'' Replaced missing features with the mean of respective features calculated using all incomplete training samples.

''dropout:'' Dropped input neutrons with missing values.

Moreover, a context encoder (CE) was trained by replacing missing features with their means. Unlike mean imputation, the complete data was used in the training phase. The method under study performed better than the imputation methods inside and outside the mask. Additionally, the method under study outperformed CE based on the whole area and area outside the mask.

The mean square error of reconstruction is used to test each method. The errors calculated over the whole area, inside and outside the mask are shown in Table 1, which indicates the method introduced in this paper is the most competitive.

[[File:Group3_Table1.png |center]]
<div align="center">Table 1: Mean square error of reconstruction on MNIST incomplete images scaled to [0, 1]</div>

'''Multilayer Perceptron'''

A multilayer perceptron with 3 ReLU hidden layers was applied to a multi-class classification problem on the Epileptic Seizure Recognition (ESR) data set taken from [3]. Each 178-dimensional vector (out of 11500 samples) is the EEG recording of a given person for 1 second, categorized into one of 5 classes. To generate missing attributes, 25%, 50%, 75%, and 90% of observations were randomly removed. The aforementioned imputation methods were used in addition to Multiple Imputation by Chained Equation (mice) and a mixture of Gaussians (gmm). The former utilizes the conditional distribution of data by Markov chain Monte Carlo techniques to draw imputations. The latter replaces missing features with values sampled from GMM estimated from incomplete data using the EM algorithm.

Double 5-fold cross-validation was used to report classification results. The classical accuracy measure is usually being used for accessing the results.The model under study outperformed classical imputation methods, which give reasonable results only for a low number of missing values. The method under study performs nearly as well as CE, even though CE had access to complete training data.

'''Radial Basis Function Network'''

RBFN can be considered as a minimal architecture implementing our model, which contains only one hidden layer. A cross-entropy function was applied to a softmax in the output layer. Two-class data sets retrieved from the UCI repository [4] with internally missing attributes were used. Since the classification is binary, two additional SVM kernel models which work directly with incomplete data without performing any imputations were included in the experiment:

''geom:'' The objective function is based on the geometric interpretation of the margin and aims to maximize the margin of each sample in its own subspace [5].

''karma:'' This algorithm iteratively tunes kernel classifier under low-rank assumptions [6].

The above SVM methods were combined with RBF kernel function. The number of RBF units was selected in the inner cross-validation from the range {25, 50, 75, 100}. Initial centers of RBFNs were randomly selected from training data while variances were samples from <math>N(0,1)</math> distribution. For SVM methods, the margin parameter <math>C</math> and kernel radius <math>\gamma</math> were selected from <math>\{2^k :k=−5,−3,...,9\}</math> for both parameters. For karma, additional parameter <math>\gamma_{karma}</math> was selected from the set <math>\{1, 2\}</math>.

The model under study outperformed imputation techniques in almost all cases. It partially confirms that the use of raw incomplete data in neural networks is a usually better approach than filling missing attributes before the learning process. Moreover, it obtained more accurate results than modified kernel methods, which directly work on incomplete data. The performance of the model was once again comparable to, and in some cases better than CE, which had access to the complete data.

== Conclusion ==

The results with these experiments along with the theoretical results conclude that this novel approach for dealing with missing data through a modification of a neural network is beneficial and outperforms many existing methods. This approach, which utilizes representing missing data with a probability density function, allows a neural network to determine a more generalized and accurate response of the neuron.

== Critiques ==

- A simulation study where the mechanism of missingness is known will be interesting to examine. Doing this will allow us to see when the proposed method is better than existing methods, and under what conditions.

- This method of imputing incomplete data has many limitations: in most cases when we have a missing data point we are facing a relatively small amount of data that does not require training of a neural network. For a large dataset, missing records does not seem to be very crucial because obtaining data will be relatively easier, and using an empirical way of imputing data such as a majority vote will be sufficient.

- An interesting application of this problem is in NLP. In NLP, especially Question Answering, there is a problem where a query is given and an answer must be retrieved, but the knowledge base is incomplete. There is therefore a requirement for the model to be able to infer information from the existing knowledge base in order to answer the question. Although this problem is a little more contrived than the one mentioned here, it is nevertheless similar in nature because it requires the ability to probabilistically determine some value which can then be used as a response.

- The experiments in this paper evaluate this method against low amounts of missing data. It would be interesting to see the properties of this imputation when a majority of the data is missing, and see if this method can outperform dropout training in this setting (dropout is known to be surprisingly robust even at high drop levels).

- This problem can possibly be applied to face recognition where given a blurry image of a person's face, the neural network can make the image clearer such that the face of the person would be visible for humans to see and also possible for the software to identify who the person is.

- This novel approach can also be applied to restoring damaged handwritten historical documents. By feeding in a damaged document with portions of unreadable texts, the neural network can add missing information utilizing a trained context encoder

- It will be interesting to see how this method performs with audio data, i.e. say if there are parts of an audio file that are missing, whether the neural network will be able to learn the underlying distribution and impute the missing sections of speech.

- In general, data are usually missing in a specific part of the content. For example, old books usually have first couple page or last couple pages that are missing. It would be interesting to see that how the distribution of "missing data" will be applied in those cases.

- In this paper, the researchers were able to outperform existing imputation methods using neural networks. It would be really nice to see how does the usage of neural networks impact the need for amount of data, and how much more training is required in comparison to the other algorithms provided in this paper.

- It might be an interesting approach to investigate how the size of missing data may influence the training. For example, in the MNIST AutoEncoder, some algorithms involve masks to generate more general images and avoid overfitting. The approach could compare the result by changing the size of the "missing" part to illustrate to what degree can we ignore the missing data and view them as assistance.

- It would be nice to see how the method under study outperforms both the Multilayer Perceptron and Radial Basis Function Network methods mentioned in the summary. Since these two methods are placed under the Experimental Results section, it is expected that they consist more justifications and supporting evidence (analytical results, tables, graphs, etc.) then what have been presented, so that it is more convincing for the readers.

== References ==
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous
data. In Advances in neural information processing systems, pages 395–401, 1996.

[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[3] Ralph G Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and Christian E Elger. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical Review E, 64(6):061907, 2001.

[4] Arthur Asuncion and David J. Newman. UCI Machine Learning Repository, 2007.

[5] Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, and Daphne Koller. Max-margin classification of data with absent features. Journal of Machine Learning Research, 9:1–21, 2008.

[6] Elad Hazan, Roi Livni, and Yishay Mansour. Classification with low rank and missing data. In Proceedings of The 32nd International Conference on Machine Learning, pages 257–266, 2015.

Influenza Forecasting Framework based on Gaussian Processes

2020-11-27T14:08:35Z

D5cui: /* Related Work */

== Abstract ==

This paper presents a novel framework for seasonal epidemic forecasting using Gaussian process regression. Compared with other state-of-the-art models, the new novel framework introduced in this paper shows accurate retrospective forecasts through training a subset of the CDC influenza-like-illness (ILI) dataset using the official CDC scoring rule (log-score).

== Background ==

Each year, the seasonal influenza epidemic affects public health systems across the globe on a massive scale. In 2019-2020 alone, the United States reported 38 million cases, 400 000 hospitalizations, and 22 000 deaths [1]. A direct result of this is the need for reliable forecasts of future influenza development because they allow for improved public health policies and informed resource development and allocation. Many statistical methods have been developed to use data from the CDC and other real-time data sources, such as Google Trends to forecast influenza activities.

Given the process of data collection and surveillance lag, accurate statistics for influenza warning systems are often delayed by some margin of time, making early prediction imperative. However, there are challenges in long-term epidemic forecasting. First, the temporal dependency is hard to capture with short-term input data. Without manually added seasonal trends, most statistical models fail to provide high accuracy. Second, the influence from other locations has not been exhaustively explored with limited data input. Spatio-temporal effects would therefore require adequate data sources to achieve good performance.

== Related Work ==

Given the value of epidemic forecasts, the CDC regularly publishes ILI data and has funded a seasonal ILI forecasting challenge. This challenge has led to four state-of-the-art models in the field: Historical averages, a method using the counts of past seasons to build a kernel density estimate; MSS, a physical susceptible-infected-recovered model with assumed linear noise [5]; SARIMA, a framework based on seasonal auto-regressive moving average models [2]; and LinEns, an ensemble of three linear regression models.

One disadvantage of forecasting is that the publication of CDC official ILI data is usually delayed by 1-3 weeks. Therefore, several attempts to use indirect web-based data to gain more timely estimates of CDC’s ILI, which is also known as nowcasting. The forecasting framework introduced in this paper can use those nowcasting approaches to receive a more timely data stream.

== Motivation ==

It has been shown that LinEns forecasts outperform the other frameworks on the ILI data-set. However, this framework assumes a deterministic relationship between the epidemic week and its case count, which does not reflect the stochastic nature of the trend. Therefore, it is natural to ask whether a similar framework that assumes a stochastic relationship between these variables would provide better performance. This motivated the development of the proposed Gaussian process regression framework and the subsequent performance comparison to the benchmark models.

== Gaussian Process Regression ==

A Gaussian process is specified with a mean function and a kernel function. Besides, the mean and the variance function of the Gaussian process is defined by:
\[\mu(x) = k(x, X)^T(K_n+\sigma^2I)^{-1}Y\]
\[\sigma(x) = k^{**}(x,x)-k(x,X)^T(K_n+\sigma^2I)^{-1}k(x,X)\]
where <math>K_n</math> is the covariance matrix and for the kernel, it is being specified that it will use the Gaussian kernel.

Consider the following set up: let <math>X = [\mathbf{x}_1,\ldots,\mathbf{x}_n]</math> <math>(d\times n)</math> be your training data, <math>\mathbf{y} = [y_1,y_2,\ldots,y_n]^T</math> be your noisy observations where <math>y_i = f(\mathbf{x}_i) + \epsilon_i</math>, <math>(\epsilon_i:i = 1,\ldots,n)</math> i.i.d. <math>\sim \mathcal{N}(0,{\sigma}^2)</math>, and <math>f</math> is the trend we are trying to model (by <math>\hat{f}</math>). Let <math>\mathbf{x}^*</math> <math>(d\times 1)</math> be your test data point, and <math>\hat{y} = \hat{f}(\mathbf{x}^*)</math> be your predicted outcome.

Instead of assuming a deterministic form of <math>f</math>, and thus of <math>\mathbf{y}</math> and <math>\hat{y}</math> (as classical linear regression would, for example), Gaussian process regression assumes <math>f</math> is stochastic. More precisely, <math>\mathbf{y}</math> and <math>\hat{y}</math> are assumed to have a joint prior distribution. Indeed, we have

$$
(\mathbf{y},\hat{y}) \sim \mathcal{N}(0,\Sigma(X,\mathbf{x}^*))
$$

where <math>\Sigma(X,\mathbf{x}^*)</math> is a matrix of covariances dependent on some kernel function <math>k</math>. In this paper, the kernel function is assumed to be Gaussian and takes the form

$$
k(\mathbf{x}_i,\mathbf{x}_j) = \sigma^2\exp(-\frac{1}{2}(\mathbf{x}_i-\mathbf{x}_j)^T\Sigma(\mathbf{x}_i-\mathbf{x}_j)).
$$

It is important to note that this gives a joint prior distribution of '''functions''' ('''Fig. 1''' left, grey curves).

By restricting this distribution to contain only those functions ('''Fig. 1''' right, grey curves) that agree with the observed data points <math>\mathbf{x}</math> ('''Fig. 1''' right, solid black) we obtain the posterior distribution for <math>\hat{y}</math> which has the form

$$
p(\hat{y} | \mathbf{x}^*, X, \mathbf{y}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}),\sigma(\mathbf{x}^*,X))
$$

<div style="text-align:center;"> [[File:GPRegression.png|500px]] </div>

<div align="center">'''Figure 1. Gaussian process regression''':
<div align="center">
Select the functions from your joint prior distribution (left, grey curves) with mean <math>0</math> (left, bold line) that agree with the observed data points (right, black bullets). </div>
<div align="center">These form your posterior distribution (right, grey curves) with mean <math>\mu(\mathbf{x})</math> (right, bold line). Red triangle helps compare the two images (location marker) [4]. </div>
</div>

== Data-set ==

Let <math>d_j^i</math> denote the number of epidemic cases recorded in week <math>j</math> of season <math>i</math>, and let <math>j^*</math> and <math>i^*</math> denote the current week and season, respectively. The ILI data-set contains <math>d_j^i</math> for all previous weeks and seasons, up to the current season with a 1-3 week publishing delay. Note that a season refers to the time of year when the epidemic is prevalent (e.g. an influenza season lasts 30 weeks and contains the last 10 weeks of year k, and the first 20 weeks of year k+1). The goal is to predict <math>\hat{y}_T = \hat{f}_T(x^*) = d^{i^*}_{j* + T}</math> where <math>T, \;(T = 1,\ldots,K)</math> is the target week (how many weeks into the future that you want to predict).

To do this, a design matrix <math>X</math> is constructed where each element <math>X_{ji} = d_j^i</math> corresponds to the number of cases in week (row) j of season (column) i. The training outcomes <math>y_{i,T}, i = 1,\ldots,n</math> correspond to the number of cases that were observed in target week <math>T,\; (T = 1,\ldots,K)</math> of season <math>i, (i = 1,\ldots,n)</math>.

== Proposed Framework ==

To compute <math>\hat{y}</math>, the following algorithm is executed:

<ol>

<li> Let <math>J \subseteq \{j^*-4 \leq j \leq j^*\}</math> (subset of possible weeks).

<li> Assemble the Training Set <math>\{X_J, \mathbf{y}_{T,J}\}</math> .

<li> Train the Gaussian process.

<li> Calculate the '''distribution''' of <math>\hat{y}_{T,J}</math> using <math>p(\hat{y}_{T,J} | \mathbf{x}^*, X_J, \mathbf{y}_{T,J}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}_{T,J}),\sigma(\mathbf{x}^*,X_J))</math>.

<li> Set <math>\hat{y}_{T,J} =\mu(x^*,X_J,\mathbf{y}_{T,J})</math>.

<li> Repeat steps 2-5 for all sets of weeks <math>J</math>.

<li> Determine the best 3 performing sets J (on the 2010/11 and 2011/12 validation sets).

<li> Calculate the ensemble forecast by averaging the 3 best performing predictive distribution densities i.e. <math>\hat{y}_T = \frac{1}{3}\sum_{k=1}^3 \hat{y}_{T,J_{best}}</math>.

</ol>

== Results ==

To demonstrate the accuracy of their results, retrospective forecasting was done on the ILI data-set. Retrospective forecasting involves using information from a specific time period in the past for example a certain week of the previous year, and tries to forecast future values for example the next week's cases. Then, the forecasting accuracy can be evaluated on the actual observed value. In other words, the Gaussian process model was trained to assume a previous season (2012/13) was the current season. In this fashion, the forecast could be compared to the already observed true outcome.

To produce a forecast for the entire 2012/13 season, 30 Gaussian processes were trained (each influenza season has 30 test points <math>\mathbf{x^*}</math>) and a curve connecting the predicted outputs <math>y_T = \hat{f}(\mathbf{x^*)}</math> was plotted ('''Fig.2''', blue line). As shown in '''Fig.2''', this forecast (blue line) was reliable for both 1 (left) and 3 (right) week targets, given that the 95% prediction interval ('''Fig.2''', purple shaded) contained the true values ('''Fig.2''', red x's) 95% of the time.

<div style="text-align:center;"> [[File:ResultsOne.png|600px]] </div>

<div align="center">'''Figure 2. Retrospective forecasts and their uncertainty''':
<div align="center">One week retrospective influenza forecasting for two targets (T = 1, 3). Red x’s are the true observed values, and blue lines and purple shaded areas represent point forecasts and 95% prediction intervals, respectively. </div>
</div>

Moreover, as shown in '''Fig.3''', the novel Gaussian process regression framework outperformed all state-of-the-art models, included LinEns, for four different targets <math>(T = 1,\ldots, 4)</math>, when compared using the official CDC scoring criterion ''log-score''. Log-score describes the logarithmic probability of the forecast being within an interval around the true value.

<div style="text-align:center;"> [[File:ComparisonNew.png|600px]] </div>

<div align="center">'''Figure 3. Average log-score gain of proposed framework''':
<div align="center"> Each bar shows the mean seasonal log-score gain of the proposed framework vs. the given state-of-the-art model, and each panel corresponds to a different target week <math> T = 1,...4 </math>. </div>
</div>

== Conclusion ==

This paper presented a novel framework for forecasting seasonal epidemics using Gaussian process regression that outperformed multiple state-of-the-art forecasting methods on the CDC's ILI data-set. Since influenza data has only been collected from 2003, therefore as more data is obtained: the training models can be further improved upon with more robust model optimizations. Hence, this work may play a key role in future influenza forecasting and as result, the improvement of public health policies and resource allocation.

== Critique ==

The proposed framework provides a computationally efficient method to forecast any seasonal epidemic count data that is easily extendable to multiple target types. In particular, one can compute key parameters such as the peak infection incidence <math>\left(\hat{y} = \text{max}_{0 \leq j \leq 52} d^i_j \right)</math>, the timing of the peak infection incidence <math>\left(\hat{y} = \text{argmax}_{0 \leq j \leq 52} d^i_j\right)</math> and the final epidemic size of a season <math>\left(\hat{y} = \sum_{j=1}^{52} d^i_j\right)</math>. However, given it is not a physical model, it cannot provide insights on parameters describing the disease spread. Moreover, the framework requires training data and hence, is not applicable for non-seasonal epidemics.

This paper provides a state of the art approach for forecasting epidemics. It would have been interesting to see other types of kernels being used, such as a periodic kernel <math>k(x, x') = \sigma^2 \exp\left({-\frac{2 \sin^2 (\pi|x-x'|)/p}{l^2}}\right) </math>, as intuitively epidemics are known to have waves within seasons. This may have resulted in better-calibrated uncertainty estimates as well.

It is mentioned that the the framework might not be good for non-seasonal epidemics because it requires training data, given that the COVID-19 pandemic comes in multiple waves and we have enough data from the first wave and second wave, we might be able to use this framework to predict the third wave and possibly the fourth one as well. It'd be interesting to see this forecasting framework being trained using the data from the first and second wave of COVID-19.

Gaussian Process Regression (GPR) can be extended to COVID-19 outbreak forecast [3]. A Multi-Task Gaussian Process Regression (MTGP) model is a special case of GPR that yields multiple outputs. It was first proposed back in 2008, and its recent application was in the battery capacity prediction by Richardson et. al. The proposed MTGP model was applied on Worldwide, China, India, Italy, and USA data that covers COVID-19 statistics from Dec 31 2019 to June 25 2020. Results show that MTGP outperforms linear regression, support vector regression, random forest regression, and long short-term memory model when it comes to forecasting accuracy.

A critique for GPR is that it does not take into account the physicality of the epidemic process. So it would be interesting to investigate if we can incorporate elements of physical processes in this such as using Ordinary Differential Equations (ODE) based modeling and seeing if that improves prediction accuracy or helps us explain phenomena.

This is a very nice summary and everything was discussed clearly. I think it will be interesting that apply GPR by using the data in COVID-19, maybe this will be helpful to give a sense that how likely that 3rd wave of COVID-19 will happen. Maybe it will be better to discuss more that a training data set from a location will not train a good model for other countries. (p.s. I think <math> \sigma </math> is the standard deviation, sometimes saying variance function with the notation <math> \sigma </math> confused me.).

== References ==

[1] Estimated Influenza Illnesses, Medical visits, Hospitalizations, and Deaths in the United States - 2019–2020 Influenza Season. (2020). Retrieved November 16, 2020, from https://www.cdc.gov/flu/about/burden/2019-2020.html

[2] Ray, E. L., Sakrejda, K., Lauer, S. A., Johansson, M. A.,and Reich, N. G. (2017).Infectious disease prediction with kernel conditional densityestimation.Statistics in Medicine, 36(30):4908–4929.

[3] Ketu, S., Mishra, P.K. Enhanced Gaussian process regression-based forecasting model for COVID-19 outbreak and significance of IoT for its detection. Appl Intell (2020). https://doi.org/10.1007/s10489-020-01889-9

[4] Schulz, E., Speekenbrink, M., and Krause, A. (2017).A tutorial on gaussian process regression with a focus onexploration-exploitation scenarios.bioRxiv.

[5] Zimmer, C., Leuba, S. I., Cohen, T., and Yaesoubi, R.(2019).Accurate quantification of uncertainty in epidemicparameter estimates and predictions using stochasticcompartmental models.Statistical Methods in Medical Research,28(12):3591–3608.PMID: 30428780.

Influenza Forecasting Framework based on Gaussian Processes

2020-11-27T14:07:05Z

D5cui: /* Background */

== Abstract ==

This paper presents a novel framework for seasonal epidemic forecasting using Gaussian process regression. Compared with other state-of-the-art models, the new novel framework introduced in this paper shows accurate retrospective forecasts through training a subset of the CDC influenza-like-illness (ILI) dataset using the official CDC scoring rule (log-score).

== Background ==

Each year, the seasonal influenza epidemic affects public health systems across the globe on a massive scale. In 2019-2020 alone, the United States reported 38 million cases, 400 000 hospitalizations, and 22 000 deaths [1]. A direct result of this is the need for reliable forecasts of future influenza development because they allow for improved public health policies and informed resource development and allocation. Many statistical methods have been developed to use data from the CDC and other real-time data sources, such as Google Trends to forecast influenza activities.

Given the process of data collection and surveillance lag, accurate statistics for influenza warning systems are often delayed by some margin of time, making early prediction imperative. However, there are challenges in long-term epidemic forecasting. First, the temporal dependency is hard to capture with short-term input data. Without manually added seasonal trends, most statistical models fail to provide high accuracy. Second, the influence from other locations has not been exhaustively explored with limited data input. Spatio-temporal effects would therefore require adequate data sources to achieve good performance.

== Related Work ==

Given the value of epidemic forecasts, the CDC regularly publishes ILI data and has funded a seasonal ILI forecasting challenge. This challenge has led to four state-of-the-art models in the field: Historical averages, a method using the counts of past seasons to build a kernel density estimate; MSS, a physical susceptible-infected-recovered model with assumed linear noise [5]; SARIMA, a framework based on seasonal auto-regressive moving average models [2]; and LinEns, an ensemble of three linear regression models.

One disadvantage of forecasting is that the publication of CDC official ILI data is usually delayed by 1-3 weeks. Therefore, several attempts use indirect web-based data to gain more timely estimates of CDC’s ILI, which is also known as nowcasting. The forecasting framework introduced in this paper is able to use those nowcasting approaches to receive a more timely data stream.

== Motivation ==

It has been shown that LinEns forecasts outperform the other frameworks on the ILI data-set. However, this framework assumes a deterministic relationship between the epidemic week and its case count, which does not reflect the stochastic nature of the trend. Therefore, it is natural to ask whether a similar framework that assumes a stochastic relationship between these variables would provide better performance. This motivated the development of the proposed Gaussian process regression framework and the subsequent performance comparison to the benchmark models.

== Gaussian Process Regression ==

A Gaussian process is specified with a mean function and a kernel function. Besides, the mean and the variance function of the Gaussian process is defined by:
\[\mu(x) = k(x, X)^T(K_n+\sigma^2I)^{-1}Y\]
\[\sigma(x) = k^{**}(x,x)-k(x,X)^T(K_n+\sigma^2I)^{-1}k(x,X)\]
where <math>K_n</math> is the covariance matrix and for the kernel, it is being specified that it will use the Gaussian kernel.

Consider the following set up: let <math>X = [\mathbf{x}_1,\ldots,\mathbf{x}_n]</math> <math>(d\times n)</math> be your training data, <math>\mathbf{y} = [y_1,y_2,\ldots,y_n]^T</math> be your noisy observations where <math>y_i = f(\mathbf{x}_i) + \epsilon_i</math>, <math>(\epsilon_i:i = 1,\ldots,n)</math> i.i.d. <math>\sim \mathcal{N}(0,{\sigma}^2)</math>, and <math>f</math> is the trend we are trying to model (by <math>\hat{f}</math>). Let <math>\mathbf{x}^*</math> <math>(d\times 1)</math> be your test data point, and <math>\hat{y} = \hat{f}(\mathbf{x}^*)</math> be your predicted outcome.

Instead of assuming a deterministic form of <math>f</math>, and thus of <math>\mathbf{y}</math> and <math>\hat{y}</math> (as classical linear regression would, for example), Gaussian process regression assumes <math>f</math> is stochastic. More precisely, <math>\mathbf{y}</math> and <math>\hat{y}</math> are assumed to have a joint prior distribution. Indeed, we have

$$
(\mathbf{y},\hat{y}) \sim \mathcal{N}(0,\Sigma(X,\mathbf{x}^*))
$$

where <math>\Sigma(X,\mathbf{x}^*)</math> is a matrix of covariances dependent on some kernel function <math>k</math>. In this paper, the kernel function is assumed to be Gaussian and takes the form

$$
k(\mathbf{x}_i,\mathbf{x}_j) = \sigma^2\exp(-\frac{1}{2}(\mathbf{x}_i-\mathbf{x}_j)^T\Sigma(\mathbf{x}_i-\mathbf{x}_j)).
$$

It is important to note that this gives a joint prior distribution of '''functions''' ('''Fig. 1''' left, grey curves).

By restricting this distribution to contain only those functions ('''Fig. 1''' right, grey curves) that agree with the observed data points <math>\mathbf{x}</math> ('''Fig. 1''' right, solid black) we obtain the posterior distribution for <math>\hat{y}</math> which has the form

$$
p(\hat{y} | \mathbf{x}^*, X, \mathbf{y}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}),\sigma(\mathbf{x}^*,X))
$$

<div style="text-align:center;"> [[File:GPRegression.png|500px]] </div>

<div align="center">'''Figure 1. Gaussian process regression''':
<div align="center">
Select the functions from your joint prior distribution (left, grey curves) with mean <math>0</math> (left, bold line) that agree with the observed data points (right, black bullets). </div>
<div align="center">These form your posterior distribution (right, grey curves) with mean <math>\mu(\mathbf{x})</math> (right, bold line). Red triangle helps compare the two images (location marker) [4]. </div>
</div>

== Data-set ==

Let <math>d_j^i</math> denote the number of epidemic cases recorded in week <math>j</math> of season <math>i</math>, and let <math>j^*</math> and <math>i^*</math> denote the current week and season, respectively. The ILI data-set contains <math>d_j^i</math> for all previous weeks and seasons, up to the current season with a 1-3 week publishing delay. Note that a season refers to the time of year when the epidemic is prevalent (e.g. an influenza season lasts 30 weeks and contains the last 10 weeks of year k, and the first 20 weeks of year k+1). The goal is to predict <math>\hat{y}_T = \hat{f}_T(x^*) = d^{i^*}_{j* + T}</math> where <math>T, \;(T = 1,\ldots,K)</math> is the target week (how many weeks into the future that you want to predict).

To do this, a design matrix <math>X</math> is constructed where each element <math>X_{ji} = d_j^i</math> corresponds to the number of cases in week (row) j of season (column) i. The training outcomes <math>y_{i,T}, i = 1,\ldots,n</math> correspond to the number of cases that were observed in target week <math>T,\; (T = 1,\ldots,K)</math> of season <math>i, (i = 1,\ldots,n)</math>.

== Proposed Framework ==

To compute <math>\hat{y}</math>, the following algorithm is executed:

<ol>

<li> Let <math>J \subseteq \{j^*-4 \leq j \leq j^*\}</math> (subset of possible weeks).

<li> Assemble the Training Set <math>\{X_J, \mathbf{y}_{T,J}\}</math> .

<li> Train the Gaussian process.

<li> Calculate the '''distribution''' of <math>\hat{y}_{T,J}</math> using <math>p(\hat{y}_{T,J} | \mathbf{x}^*, X_J, \mathbf{y}_{T,J}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}_{T,J}),\sigma(\mathbf{x}^*,X_J))</math>.

<li> Set <math>\hat{y}_{T,J} =\mu(x^*,X_J,\mathbf{y}_{T,J})</math>.

<li> Repeat steps 2-5 for all sets of weeks <math>J</math>.

<li> Determine the best 3 performing sets J (on the 2010/11 and 2011/12 validation sets).

<li> Calculate the ensemble forecast by averaging the 3 best performing predictive distribution densities i.e. <math>\hat{y}_T = \frac{1}{3}\sum_{k=1}^3 \hat{y}_{T,J_{best}}</math>.

</ol>

== Results ==

To demonstrate the accuracy of their results, retrospective forecasting was done on the ILI data-set. Retrospective forecasting involves using information from a specific time period in the past for example a certain week of the previous year, and tries to forecast future values for example the next week's cases. Then, the forecasting accuracy can be evaluated on the actual observed value. In other words, the Gaussian process model was trained to assume a previous season (2012/13) was the current season. In this fashion, the forecast could be compared to the already observed true outcome.

To produce a forecast for the entire 2012/13 season, 30 Gaussian processes were trained (each influenza season has 30 test points <math>\mathbf{x^*}</math>) and a curve connecting the predicted outputs <math>y_T = \hat{f}(\mathbf{x^*)}</math> was plotted ('''Fig.2''', blue line). As shown in '''Fig.2''', this forecast (blue line) was reliable for both 1 (left) and 3 (right) week targets, given that the 95% prediction interval ('''Fig.2''', purple shaded) contained the true values ('''Fig.2''', red x's) 95% of the time.

<div style="text-align:center;"> [[File:ResultsOne.png|600px]] </div>

<div align="center">'''Figure 2. Retrospective forecasts and their uncertainty''':
<div align="center">One week retrospective influenza forecasting for two targets (T = 1, 3). Red x’s are the true observed values, and blue lines and purple shaded areas represent point forecasts and 95% prediction intervals, respectively. </div>
</div>

Moreover, as shown in '''Fig.3''', the novel Gaussian process regression framework outperformed all state-of-the-art models, included LinEns, for four different targets <math>(T = 1,\ldots, 4)</math>, when compared using the official CDC scoring criterion ''log-score''. Log-score describes the logarithmic probability of the forecast being within an interval around the true value.

<div style="text-align:center;"> [[File:ComparisonNew.png|600px]] </div>

<div align="center">'''Figure 3. Average log-score gain of proposed framework''':
<div align="center"> Each bar shows the mean seasonal log-score gain of the proposed framework vs. the given state-of-the-art model, and each panel corresponds to a different target week <math> T = 1,...4 </math>. </div>
</div>

== Conclusion ==

This paper presented a novel framework for forecasting seasonal epidemics using Gaussian process regression that outperformed multiple state-of-the-art forecasting methods on the CDC's ILI data-set. Since influenza data has only been collected from 2003, therefore as more data is obtained: the training models can be further improved upon with more robust model optimizations. Hence, this work may play a key role in future influenza forecasting and as result, the improvement of public health policies and resource allocation.

== Critique ==

The proposed framework provides a computationally efficient method to forecast any seasonal epidemic count data that is easily extendable to multiple target types. In particular, one can compute key parameters such as the peak infection incidence <math>\left(\hat{y} = \text{max}_{0 \leq j \leq 52} d^i_j \right)</math>, the timing of the peak infection incidence <math>\left(\hat{y} = \text{argmax}_{0 \leq j \leq 52} d^i_j\right)</math> and the final epidemic size of a season <math>\left(\hat{y} = \sum_{j=1}^{52} d^i_j\right)</math>. However, given it is not a physical model, it cannot provide insights on parameters describing the disease spread. Moreover, the framework requires training data and hence, is not applicable for non-seasonal epidemics.

This paper provides a state of the art approach for forecasting epidemics. It would have been interesting to see other types of kernels being used, such as a periodic kernel <math>k(x, x') = \sigma^2 \exp\left({-\frac{2 \sin^2 (\pi|x-x'|)/p}{l^2}}\right) </math>, as intuitively epidemics are known to have waves within seasons. This may have resulted in better-calibrated uncertainty estimates as well.

It is mentioned that the the framework might not be good for non-seasonal epidemics because it requires training data, given that the COVID-19 pandemic comes in multiple waves and we have enough data from the first wave and second wave, we might be able to use this framework to predict the third wave and possibly the fourth one as well. It'd be interesting to see this forecasting framework being trained using the data from the first and second wave of COVID-19.

Gaussian Process Regression (GPR) can be extended to COVID-19 outbreak forecast [3]. A Multi-Task Gaussian Process Regression (MTGP) model is a special case of GPR that yields multiple outputs. It was first proposed back in 2008, and its recent application was in the battery capacity prediction by Richardson et. al. The proposed MTGP model was applied on Worldwide, China, India, Italy, and USA data that covers COVID-19 statistics from Dec 31 2019 to June 25 2020. Results show that MTGP outperforms linear regression, support vector regression, random forest regression, and long short-term memory model when it comes to forecasting accuracy.

A critique for GPR is that it does not take into account the physicality of the epidemic process. So it would be interesting to investigate if we can incorporate elements of physical processes in this such as using Ordinary Differential Equations (ODE) based modeling and seeing if that improves prediction accuracy or helps us explain phenomena.

This is a very nice summary and everything was discussed clearly. I think it will be interesting that apply GPR by using the data in COVID-19, maybe this will be helpful to give a sense that how likely that 3rd wave of COVID-19 will happen. Maybe it will be better to discuss more that a training data set from a location will not train a good model for other countries. (p.s. I think <math> \sigma </math> is the standard deviation, sometimes saying variance function with the notation <math> \sigma </math> confused me.).

== References ==

[1] Estimated Influenza Illnesses, Medical visits, Hospitalizations, and Deaths in the United States - 2019–2020 Influenza Season. (2020). Retrieved November 16, 2020, from https://www.cdc.gov/flu/about/burden/2019-2020.html

[2] Ray, E. L., Sakrejda, K., Lauer, S. A., Johansson, M. A.,and Reich, N. G. (2017).Infectious disease prediction with kernel conditional densityestimation.Statistics in Medicine, 36(30):4908–4929.

[3] Ketu, S., Mishra, P.K. Enhanced Gaussian process regression-based forecasting model for COVID-19 outbreak and significance of IoT for its detection. Appl Intell (2020). https://doi.org/10.1007/s10489-020-01889-9

[4] Schulz, E., Speekenbrink, M., and Krause, A. (2017).A tutorial on gaussian process regression with a focus onexploration-exploitation scenarios.bioRxiv.

[5] Zimmer, C., Leuba, S. I., Cohen, T., and Yaesoubi, R.(2019).Accurate quantification of uncertainty in epidemicparameter estimates and predictions using stochasticcompartmental models.Statistical Methods in Medical Research,28(12):3591–3608.PMID: 30428780.

User:Bsharman

2020-11-27T13:53:00Z

D5cui:

'''Risk prediction in life insurance industry using supervised learning algorithms'''

'''Presented By'''

Bharat Sharman, Dylan Li, Leonie Lu, Mingdao Li

==Introduction==

Risk assessment lies at the core of the life insurance industry. It is extremely important for companies to assess the risk of an application accurately to ensure that those submitted by actual low risk applicants are accepted, whereas applications submitted by high risk applicants are either rejected or placed aside for further review. The types of risks include but are not limited to, a person's smoking status, and family history (such as history of heart disease). If this is not the case, individuals with an extremely high risk profile might be issued policies, often leading to high insurance payouts and large losses for the company. Such a situation is called ‘Adverse Selection’, where individuals who are most likely to be issued a payout are in fact given a policy and those who are not as likely are not, thus causing the company to suffer losses as a result.

Traditionally, the process of underwriting (deciding whether or not to insure the life of an individual) has been done using actuarial calculations and judgements from underwriters. Actuaries group customers according to their estimated levels of risk determined from historical data. (Cummins J, 2013) However, these conventional techniques are time-consuming and it is not uncommon to take a month to issue a policy. They are expensive as a lot of manual processes need to be executed and a lot of data needs to be imported for the purpose of calculation.

Predictive analysis is an effective technique that simplifies the underwriting process to reduce policy issuance time and improve the accuracy of risk prediction. In life insurance, this approach is widely used in mortality rates modeling to help underwriters make decisions and improve the profitability of the business. In this paper, the authors used data from Prudential Life Insurance company and investigated the most appropriate data extraction method and the most appropriate algorithm to assess risk.

==Literature Review==

Before a life insurance company issues a policy, it must execute a series of underwriting related tasks (Mishr, 2016). These tasks involve gathering extensive information about the applicants. The insurer has to analyze the employment, medical, family and insurance histories of the applicants and factor all of them into a complicated series of calculations to determine the risk rating of the applicants. On basis of this risk rating, premiums are calculated (Prince, 2016).

In a competitive marketplace, customers need policies to be issued quickly and long wait times can lead to them switch to other providers (Chen 2016). In addition, the costs of data gathering and analysis can be expensive. The insurance company bears the expenses of the medical examinations and if a policy lapses, then the insurer has to bear the losses of all these costs (J Carson, 2017). If the underwriting process uses predictive analytics, then the costs and time associated with many of these processes will be reduced significantly.

The key importance of a strong underwriting process for a life insurer is to avoid the risk of adverse selection. Adverse selection occurs in situations when an applicant has more knowledge about their health or conditions than the insurance company, and insurers can incur significant losses if they are systematically issuing policies to high-risk applicants due to this asymmetry of information. In order to avoid adverse selection, a strong classification system that correctly groups applicants into their appropriate risk levels is needed, and this is the motivation for this research.

==Methods and Techniques==

In Figure 1, the process flow of the analytics approach has been depicted. These stages will now be described in the following sections.

[[File:Data_Analytics_Process_Flow.PNG]]

'''Description of the Dataset'''

The data is obtained from the Kaggle competition hosted by the Prudential Life Insurance company. It has 59381 applications with 128 attributes. The attributes are continuous and discrete as well as categorical variables.
The data attributes, their types and the description is shown in Table 1 below:

[[File:Data Attributes Types and Description.png]]

'''Data Pre-Processing'''

In the data preprocessing step, missing values in the data are either imputed or dropped and some of the attributes are either transformed to make the subsequent processing of data easier. This decision is made after determining the mechanism of missingness, that is if the data is Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR).

'''Data Exploration using Visual Analytics'''

The Exploratory Data Analysis (EDA) is composed by uni-variate and bivariate analyses, which allows the researchers to understand different distributions' features. Visual analytics aims to gain insights of data structures by creating charts and graphs for the prediction models.

'''Dimensionality Reduction'''

In this paper, there are two methods that have been used for dimensionality reduction –

1.Correlation based Feature Selection (CFS): This is a feature selection method in which a subset of features from the original features is selected. In this method, the algorithm selects features from the dataset that are highly correlated with the output but are not correlated with each other. The user does not need to specify the number of features to be selected. The correlation values are calculated based on measures such as Pearson’s coefficient, minimum description length, symmetrical uncertainty, and relief.

2.Principal Components Analysis (PCA): PCA is a feature extraction method that transforms existing features into new sets of features such that the correlation between them is zero and these transformed features explain the maximum variability in the data.

PCA creates new features based on existing ones, while CFS only selects the best attributes based on the predictive power. Although PCA performs some feature engineering on attributes in data sets, the resulting new features are more complex to explain because it is difficult to derive meanings from principal components. On the other hand, CFS is easier to understand and interpret because the original features have not been merged or modified.

==Supervised Learning Algorithms==

The four algorithms that have been used in this paper are the following:

1.Multiple Linear Regression: In MLR, the relationship between the dependent and the two or more independent variables is predicted by fitting a linear model. The model parameters are calculated by minimizing the sum of squares of the errors, which captures the average distance between the predicted data points and observed data. The significance of the variables is determined by tests like the F test and the p-values.

2.REPTree: REPTree stands for reduced error pruning tree. It can build both classification and regression trees, depending on the type of the response variable. In this case, it uses regression tree logic and creates many trees across several iterations. This algorithm develops these trees based on the principles of information gain and variance reduction. At the time of pruning the tree, the algorithm uses the lowest mean square error to select the best tree.

3.Random Tree (Also known as the Random Forest): A random tree selects some of the attributes at each node in the decision tree and builds a tree based on random selection of data as well as attributes. Random Tree does not do pruning. Instead, it estimates class probabilities based on a hold-out set.

4.Artificial Neural Network: In a neural network, the inputs are transformed into outputs via a series of layered units where each of these units transforms the input received via a function into an output that gets further transmitted to units down the line. The weights that are used to weigh the inputs are improved after each iteration via a method called backpropagation in which errors propagate backward in the network and are used to update the weights to make the computed output closer to the actual output.

==Experiments and Results==

'''Missing Data Mechanism'''

Attributes, where more than 30% of Data was missing, were dropped from the analysis. The data were tested for Missing Completely at Random (MCAR), one form of the nature of missing values using the Little Test. The null hypothesis that the missing data was completely random had a p-value of 0 meaning, MCAR was rejected. Then, all the variables were plotted to check how many missing values that they had, and the results are shown in the figure below:

[[File: Missing Value Plot of Training Data.png]]

The variables that have the most number of missing variables are plotted at the top and that have the least number of missing variables are plotted at the bottom of the y-axis in the figure above. Missing variables do not seem to have obvious patterns and therefore they are assumed to be Missing at Random (MAR), meaning the tendency for the variables to be missing is not related to the missing data but to the observed data.

'''Missing Data Imputation'''

Assuming that missing data follows a MAR pattern, multiple imputations are used as a technique to fill in the values of missing data. The steps involved in multiple imputation are the following:

Imputation: The imputation of the missing values is done over several steps and this results in a number of complete data sets. Imputation is usually done via a predictive model like linear regression to predict these missing values based on other variables in the data set.

Analysis: The complete data sets that are formed are analyzed and parameter estimates and standard errors are calculated.

Pooling: The analysis results are then integrated to form a final data set that is then used for further analysis.

'''Comparison of Feature Selection and Feature Extraction'''

The Correlation-based Feature Selection (CFS) method was performed using the Waikato Environment for Knowledge Analysis. It was implemented using a Best-first search method on a CfsSubsetEval attribute evaluator. 33 variables were selected out of the total of 117 features.
PCA was implemented via a Ranker Search Method using a Principal Components Attributes Evaluator. Out of the 117 features, those that had a standard deviation of more than 0.5 times the standard deviation of the first principal component were selected and this resulted in 20 features for further analysis.
After dimensionality reduction, this reduced data set was exported and used for building prediction models using the four machine learning algorithms discussed before – REPTree, Multiple Linear Regression, Random Tree, and ANNs. The results are shown in the table below:

[[File: Comparison of Results between CFS and PCA.png]]

For CFS, the REPTree model had the lowest MAE and RMSE. For PCA, the Multiple Linear Regression Model had the lowest MAE as well as RMSE. So, for this dataset, it seems that overall, Multiple Linear Regression and REPTree Models are the two best ones with lowest error rates. In terms of dimensionality reduction, it seems that CFS is a better method than PCA for this data set as the MAE and RMSE values are lower for all ML methods except ANNs.

==Conclusion and Further Work==

Predictive Analytics in the Life Insurance Industry is enabling faster customer service and lower costs by helping automate the process of Underwriting, thereby increasing satisfaction and loyalty.
In this study, the authors analyzed data obtained from Prudential Life Insurance to predict risk scores via Supervised Machine Learning Algorithms. The data was first pre-processed to first replace the missing values. Attributes having more than 30% of missing data were eliminated from the analysis.
Two methods of dimensionality reduction – CFS and PCA were used and the number of attributes used for further analysis was reduced to 33 and 20 via these two methods. The Machine Learning Algorithms that were implemented were – REPTree, Random Tree, Multiple Linear Regression, and Artificial Neural Networks. Model validation was performed via ten-fold cross-validation. The performance of the models was evaluated using MAE and RMSE measures.
Using the PCA method, Multiple Linear Regression showed the best results with MAE and RMSE values of 1.64 and 2.06 respectively. With CFS, REPTree had the highest accuracy with MAE and RMSE values of 1.52 and 2.02 respectively.
Further work can be directed towards dealing with all the variables rather than deleting the ones where more than 30% of the values are missing. Customer segmentation, i.e. grouping customers based on their profiles can help companies come up with a customized policy for each group. This can be done via unsupervised algorithms like clustering. Work can also be done to make the models more explainable especially if we are using PCA and ANNs to analyze data. We can also get indirect data about the prospective applicant like their driving behavior, education record, etc to see if these attributes contribute to better risk profiling than the already available data.

==Critiques==

The project built multiple models and had utilized various methods to evaluate the result. They could potentially ensemble the prediction, such as averaging the result of the different models, to achieve a better accuracy result. Another method is model stacking, we can input the result of one model as input into another model for better results. However, they do have some major setbacks: sometimes, the result could be effect negatively (ie: increase the RMSE). In addition, if the improvement is not prominent, it would make the process much more complex thus cost time and effort. In a research setting, stacking and ensembling are definitely worth a try. In a real-life business case, it is more of a trade-off between accuracy and effort/cost.

In this application, it is not essentially the same thing as classify risky as non-risky or vice versa, if the model misclassified a risky data as non-risky, it may create a large loss to the insurance companies. It is recommended that this issue could be carefully taken care of during the model selection.

This project only used unsupervised dimensionality reduction techniques to analyze the data and did not explore supervised methods such as linear discriminant analysis, quadratic discriminant analysis, Fisher discriminant analysis, or supervised PCA. The supervisory signal could've been the normalized risk score, which determines whether someone is granted the policy. The supervision might have provided insight on factors contributing to risk that would not have been captured with unsupervised methods.

In addition, the dimensionality reduction techniques seem to contradict each other. If we do select features based on their correlation, there is no need to do further PCA on the dataset. Even if we perform PCA, the resulting columns would have a similar number of columns compared to the original ones and it would lose interpretability of the features, which is crucial in real-life practices.

Some of the models mentioned in this article require a large amount of data to perform well (e.g. neural network and tree-based algorithm). In a practical sense, a direct insurer may not have enough data (for certain lines of businesses) to train these models, overfitting could be a serious problem. The computational cost could be another critical issue. On a small dataset, Bayesian methods can be used to incorporate prior information into statistical models to obtain more refined inference.

Attributes that have more than 30% missing data were dropped. The disadvantage of this is that the dropped variable might affect other variables in the dataset. An alternative approach to dealing with missing data would be the use of Neural Networks, eliminating the need for deletion or imputation depending on the mechanism missingness (Smieja et al 2018).

The project contains multiple models and requires a lot of data to execute the algorithm well, but the data set is entirely from an insurance company. These data will have incompleteness and one-sided factors leading to inaccurate results.
In the summary part, no specific conclusions and future work are mentioned, and the structure is not obvious and requires specific explanation.

The project provides an idea for the insurance industry to analyze risks. For the model selection part, I am thinking that it might be an idea to involve more complicated models to detect the baseline for the datasets. Besides, applying more complicated models might work as a justification for the feature selection correctness.

For sections titled "Missing Data Imputation" and "Missing Data Mechanism", it would be highly significant for the authors to provide more details regarding the experimental results. For instance, the authors mentioned that the null hypothesis which the missing data was completely random had a p-value of 0 meaning, MCAR was rejected. However, no relevant calculation or explanation has been provided to justify such conclusion.

==References==

Chen, T. (2016). Corporate reputation and financial performance of Life Insurers. Geneva Papers Risk Insur Issues Pract, 378-397.

Cummins J, S. B. (2013). Risk classification in Life Insurance. Springer 1st Edition.

J Carson, C. E. (2017). Sunk costs and screening: two-part tariffs in life insurance. SSRN Electron J, 1-26.

Jayabalan, N. B. (2018). Risk prediction in life insurance industry using supervised learning algorithms. Complex & Intelligent Systems, 145-154.

Mishr, K. (2016). Fundamentals of life insurance theories and applications. PHI Learning Pvt Ltd.

Prince, A. (2016). Tantamount to fraud? Exploring non-disclosure of genetic information in life insurance applications as grounds for policy recession. Health Matrix, 255-307.

Śmieja, M., Struski, Ł., Tabor, J., Zieliński, B., & Spurek, P. (2018). Processing of missing data by neural networks. In Advances in Neural Information Processing Systems (pp. 2719-2729).

Describtion of Text Mining

2020-11-19T16:52:46Z

D5cui: /* References */

== Presented by ==
Yawen Wang, Danmeng Cui, Zijie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid

== Introduction ==
This paper focuses on the different text mining tasks and the existence of text mining in healthcare and biomedical domains. The text mining field has been popular as a result of the amount of text data that is available in different forms. The text data is bound to grow even more in 2020, indicating a 50 times growth since 2010. To further explore the text mining field, the related text mining approaches can be considered. The different text mining approaches relate to two main methods: knowledge delivery and traditional data mining methods. The authors note that knowledge delivery methods involve the application of different steps to a specific data set to create specific patterns. Research in knowledge delivery methods has evolved over the years due to advances in hardware and software technology. On the other hand, data mining has experienced substantial development through the intersection of three fields: databases, machine learning, and statistics. As brought out by the authors, text mining approaches focus on the exploration of information from a specific text. The information explored is in the form of structured, semi-structured, and unstructured text. It is important to note that text mining covers different sets of algorithms and topics that include information retrieval. The topics and algorithms are used for analyzing different text forms.

== Text Representation and Encoding ==
In this section of the paper, the authors explore the different ways in which the text can be represented on a large collection of documents. One common way of representing the documents is in the form of a bag of words. The bag of words considers the occurrences of different terms. In different text mining applications, documents are ranked and represented as vectors so as to display the significance of any word. The authors note that the three basic models used are vector space, inference network, and the probabilistic models. The vector space model is used to represent documents by converting them into vectors. In the model, a variable is used to represent each model to indicate the importance of the word in the document. The words are weighted using the TF-IDF scheme computed as
<math>q(w)=f_d(w)*log{\frac{|D|}{f_D(w)}}</math>.
In many text mining algorithms, one of the key components is preprocessing. Preprocessing consists of different tasks that include filtering, tokenization, stemming, and lemmatization. The first step is tokenization, where a character sequence is broken down into different words or phrases. After the breakdown, filtering is carried out to remove some words. The various word inflected forms are grouped together through lemmatization, and later, the derived roots of the derived words are obtained through stemming.

== Classification ==

== Clustering ==

== Clustering ==

== Information Extraction ==

== Biomedical Application ==

== Discussion ==

== References ==

Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering, and extraction techniques. arXiv preprint arXiv:1707.02919.

Describtion of Text Mining

2020-11-19T16:49:14Z

D5cui: /* Text Representation and Encoding */

Describtion of Text Mining

2020-11-19T16:35:46Z

D5cui: /* Introduction */

Describtion of Text Mining

2020-11-15T17:34:47Z

D5cui: /* References */

== Presented by ==
Yawen Wang, Danmeng Cui, Zijie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid

== Introduction ==

== Text Representation and Encoding ==

== Classification ==

== Clustering ==

== Clustering ==

== Information Extraction ==

== Biomedical Application ==

== Discussion ==

== References ==

Describtion of Text Mining

2020-11-15T17:33:57Z

D5cui: /* Discussion */

Describtion of Text Mining

2020-11-15T17:32:33Z

D5cui: /* Biomedical Application */

Describtion of Text Mining

2020-11-15T17:31:29Z

D5cui: /* Information Extraction */

Describtion of Text Mining

2020-11-15T17:30:20Z

D5cui: /* Clustering */

Describtion of Text Mining

2020-11-15T17:29:25Z

D5cui: /* Classification */

== Presented by ==
Yawen Wang, Danmeng Cui, Zijie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid

== Introduction ==

== Text Representation and Encoding ==

== Classification ==

Describtion of Text Mining

2020-11-15T17:28:23Z

D5cui: /* Text Representation and Encoding*/

== Presented by ==
Yawen Wang, Danmeng Cui, Zijie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid

== Introduction ==

== Text Representation and Encoding ==

Describtion of Text Mining

2020-11-15T17:26:29Z

D5cui: /* Introduction */

== Presented by ==
Yawen Wang, Danmeng Cui, Zijie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid

== Introduction ==

Describtion of Text Mining

2020-11-15T17:25:03Z

D5cui: /* Presented by */

== Presented by ==
Yawen Wang, Danmeng Cui, Zijie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid

stat441F21

2020-10-20T16:24:26Z

D5cui: /* Paper presentation */