http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Y93fang&feedformat=atomstatwiki - User contributions [US]2024-03-29T06:06:36ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction&diff=48900Surround Vehicle Motion Prediction2020-12-02T17:56:33Z<p>Y93fang: /* Motion planning based on surrounding vehicle motion prediction */</p>
<hr />
<div>DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''<br />
== Presented by == <br />
Mushi Wang, Siyuan Qiu, Yan Yu<br />
<br />
== Introduction ==<br />
<br />
This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.<br />
<br />
== Previous Work ==<br />
The autonomous vehicle trajectory approaches previously used motion models like Constant Velocity and Constant Acceleration. These models are linear and are only able to handle straight motions. There are curvilinear models such as Constant Turn Rate and Velocity and Constant Turn Rate and Acceleration which handle rotations and more complex motions. Together with these models, Kalman Filter is used to predicting the vehicle trajectory. However, the performance of the Kalman Filter in predicting multi-step problem is not that good. Recurrent Neural Network performs significantly better than it. <br />
<br />
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. <br />
<br />
Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].<br />
<br />
== Motivation == <br />
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.<br />
<br />
== Framework == <br />
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.<br />
<br />
<center>[[Image:Figure1_Yan.png|800px|]]</center><br />
<br />
== LSTM-RNN based motion predictor == <br />
<br />
=== Data ===<br />
Multi-lane turn intersections are the target roads in this paper. The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.<br />
<br />
=== Motion predictor ===<br />
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view. <br />
<br />
<br />
<center>[[Image:Figure7b_Yan.png|500px|]]</center><br />
<br />
<br />
==== Network architecture ==== <br />
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.<br />
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.<br />
<br />
<center>[[Image:Figure8_Yan.png|800px|]]</center><br />
<br />
==== Input and output features ==== <br />
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.<br />
<br />
==== Encoder and decoder ==== <br />
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit. <br />
==== Sequence length ==== <br />
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.<br />
<br />
== Motion planning based on surrounding vehicle motion prediction == <br />
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\<br />
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2 <br />
\end{split}<br />
\end{equation*}<br />
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles. <br />
The constraints of the control input are defined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\<br />
&||\mu(k+1|t) - \mu(k|t)|| \leq S<br />
\end{split}<br />
\end{equation*}<br />
Where <math>u_{min}</math>, <math>u_{max}</math>and S are the minimum/maximum control input and maximum slew rate of input respectively.<br />
<br />
Determine the position and speed boundary based on the predicted state:<br />
\begin{equation*}<br />
\begin{split}<br />
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\<br />
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0<br />
\end{split}<br />
\end{equation*}<br />
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.<br />
<br />
== Prediction performance analysis and application to motion planning ==<br />
=== Accuracy analysis ===<br />
The proposed algorithm was compared with the results from three base algorithms, a path-following model with <br />
constant velocity, a path-following model with traffic flow and a CTRV model.<br />
<br />
We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>, <br />
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:<br />
<br />
\begin{equation*}<br />
\begin{split}<br />
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\ <br />
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\ <br />
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\ <br />
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}<br />
\end{split}<br />
\end{equation*}<br />
<center>[[Image:Figure10.1_YanYu.png|500px|]]</center><br />
<br />
The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean, <br />
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped <br />
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers' <br />
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within <br />
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore, <br />
the proposed algorithm can be precise and maintain safety simultaneously.<br />
<br />
=== Motion planning application ===<br />
==== Case study of a multi-lane left turn scenario ====<br />
The proposed method mimics a human driver better, by simulating a human driver's decision-making process. <br />
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target <br />
the vehicle, even when the target vehicle was not following the intersection guideline.<br />
<br />
==== Statistical analysis of motion planning application results ====<br />
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to <br />
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based <br />
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when <br />
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means <br />
that these cases took place sufficiently beyond the safety distance, and had little influence on determining <br />
the behaviour of the subject vehicle.<br />
<br />
<center>[[Image:Figure11_YanYu.png|500px|]]</center><br />
<br />
In order to compare the similarities between the results form the proposed algorithm and human driving decisions, <br />
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math><br />
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm, <br />
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base <br />
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm <br />
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed <br />
model is efficient and safe.<br />
<br />
== Conclusion ==<br />
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.<br />
<br />
== Future works ==<br />
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.<br />
<br />
2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.<br />
<br />
3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.<br />
<br />
4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.<br />
<br />
== Critiques ==<br />
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.<br />
<br />
This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.<br />
<br />
Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.<br />
<br />
There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.<br />
<br />
It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.<br />
<br />
The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.<br />
<br />
It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.<br />
<br />
The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.<br />
<br />
Understandably, a supervised learning problem should be evaluated on some test dataset. However, supervised learning techniques are inherently ill-suited for general planning problems. The test dataset was obtained from human driving data which is known to be extremely noisy as well as unpredictable when it comes to motion planning. It would be crucial to determine the successes of this paper based on the state-of-the-art reinforcement learning techniques.<br />
<br />
== Reference ==<br />
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.<br />
<br />
[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.<br />
<br />
[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.<br />
<br />
[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.<br />
<br />
[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].<br />
<br />
[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction&diff=48899Surround Vehicle Motion Prediction2020-12-02T17:56:02Z<p>Y93fang: /* Motion planning based on surrounding vehicle motion prediction */</p>
<hr />
<div>DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''<br />
== Presented by == <br />
Mushi Wang, Siyuan Qiu, Yan Yu<br />
<br />
== Introduction ==<br />
<br />
This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.<br />
<br />
== Previous Work ==<br />
The autonomous vehicle trajectory approaches previously used motion models like Constant Velocity and Constant Acceleration. These models are linear and are only able to handle straight motions. There are curvilinear models such as Constant Turn Rate and Velocity and Constant Turn Rate and Acceleration which handle rotations and more complex motions. Together with these models, Kalman Filter is used to predicting the vehicle trajectory. However, the performance of the Kalman Filter in predicting multi-step problem is not that good. Recurrent Neural Network performs significantly better than it. <br />
<br />
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. <br />
<br />
Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].<br />
<br />
== Motivation == <br />
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.<br />
<br />
== Framework == <br />
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.<br />
<br />
<center>[[Image:Figure1_Yan.png|800px|]]</center><br />
<br />
== LSTM-RNN based motion predictor == <br />
<br />
=== Data ===<br />
Multi-lane turn intersections are the target roads in this paper. The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.<br />
<br />
=== Motion predictor ===<br />
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view. <br />
<br />
<br />
<center>[[Image:Figure7b_Yan.png|500px|]]</center><br />
<br />
<br />
==== Network architecture ==== <br />
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.<br />
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.<br />
<br />
<center>[[Image:Figure8_Yan.png|800px|]]</center><br />
<br />
==== Input and output features ==== <br />
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.<br />
<br />
==== Encoder and decoder ==== <br />
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit. <br />
==== Sequence length ==== <br />
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.<br />
<br />
== Motion planning based on surrounding vehicle motion prediction == <br />
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\<br />
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2 <br />
\end{split}<br />
\end{equation*}<br />
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles. <br />
The constraints of the control input are defined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\<br />
&||\mu(k+1|t) - \mu(k|t)|| \leq S<br />
\end{split}<br />
\end{equation*}<br />
Where <math>u_{min}</math>, <math>u_{max}</math>and S are the minimum/maximum control input and maximum slew rate of input respectively.<br />
Determine the position and speed boundary based on the predicted state:<br />
\begin{equation*}<br />
\begin{split}<br />
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\<br />
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0<br />
\end{split}<br />
\end{equation*}<br />
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.<br />
<br />
== Prediction performance analysis and application to motion planning ==<br />
=== Accuracy analysis ===<br />
The proposed algorithm was compared with the results from three base algorithms, a path-following model with <br />
constant velocity, a path-following model with traffic flow and a CTRV model.<br />
<br />
We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>, <br />
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:<br />
<br />
\begin{equation*}<br />
\begin{split}<br />
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\ <br />
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\ <br />
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\ <br />
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}<br />
\end{split}<br />
\end{equation*}<br />
<center>[[Image:Figure10.1_YanYu.png|500px|]]</center><br />
<br />
The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean, <br />
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped <br />
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers' <br />
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within <br />
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore, <br />
the proposed algorithm can be precise and maintain safety simultaneously.<br />
<br />
=== Motion planning application ===<br />
==== Case study of a multi-lane left turn scenario ====<br />
The proposed method mimics a human driver better, by simulating a human driver's decision-making process. <br />
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target <br />
the vehicle, even when the target vehicle was not following the intersection guideline.<br />
<br />
==== Statistical analysis of motion planning application results ====<br />
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to <br />
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based <br />
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when <br />
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means <br />
that these cases took place sufficiently beyond the safety distance, and had little influence on determining <br />
the behaviour of the subject vehicle.<br />
<br />
<center>[[Image:Figure11_YanYu.png|500px|]]</center><br />
<br />
In order to compare the similarities between the results form the proposed algorithm and human driving decisions, <br />
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math><br />
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm, <br />
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base <br />
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm <br />
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed <br />
model is efficient and safe.<br />
<br />
== Conclusion ==<br />
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.<br />
<br />
== Future works ==<br />
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.<br />
<br />
2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.<br />
<br />
3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.<br />
<br />
4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.<br />
<br />
== Critiques ==<br />
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.<br />
<br />
This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.<br />
<br />
Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.<br />
<br />
There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.<br />
<br />
It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.<br />
<br />
The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.<br />
<br />
It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.<br />
<br />
The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.<br />
<br />
Understandably, a supervised learning problem should be evaluated on some test dataset. However, supervised learning techniques are inherently ill-suited for general planning problems. The test dataset was obtained from human driving data which is known to be extremely noisy as well as unpredictable when it comes to motion planning. It would be crucial to determine the successes of this paper based on the state-of-the-art reinforcement learning techniques.<br />
<br />
== Reference ==<br />
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.<br />
<br />
[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.<br />
<br />
[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.<br />
<br />
[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.<br />
<br />
[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].<br />
<br />
[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction&diff=48898Surround Vehicle Motion Prediction2020-12-02T17:55:36Z<p>Y93fang: /* Motion planning based on surrounding vehicle motion prediction */</p>
<hr />
<div>DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''<br />
== Presented by == <br />
Mushi Wang, Siyuan Qiu, Yan Yu<br />
<br />
== Introduction ==<br />
<br />
This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.<br />
<br />
== Previous Work ==<br />
The autonomous vehicle trajectory approaches previously used motion models like Constant Velocity and Constant Acceleration. These models are linear and are only able to handle straight motions. There are curvilinear models such as Constant Turn Rate and Velocity and Constant Turn Rate and Acceleration which handle rotations and more complex motions. Together with these models, Kalman Filter is used to predicting the vehicle trajectory. However, the performance of the Kalman Filter in predicting multi-step problem is not that good. Recurrent Neural Network performs significantly better than it. <br />
<br />
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. <br />
<br />
Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].<br />
<br />
== Motivation == <br />
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.<br />
<br />
== Framework == <br />
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.<br />
<br />
<center>[[Image:Figure1_Yan.png|800px|]]</center><br />
<br />
== LSTM-RNN based motion predictor == <br />
<br />
=== Data ===<br />
Multi-lane turn intersections are the target roads in this paper. The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.<br />
<br />
=== Motion predictor ===<br />
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view. <br />
<br />
<br />
<center>[[Image:Figure7b_Yan.png|500px|]]</center><br />
<br />
<br />
==== Network architecture ==== <br />
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.<br />
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.<br />
<br />
<center>[[Image:Figure8_Yan.png|800px|]]</center><br />
<br />
==== Input and output features ==== <br />
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.<br />
<br />
==== Encoder and decoder ==== <br />
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit. <br />
==== Sequence length ==== <br />
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.<br />
<br />
== Motion planning based on surrounding vehicle motion prediction == <br />
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\<br />
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2 <br />
\end{split}<br />
\end{equation*}<br />
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles. <br />
The constraints of the control input are defined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\<br />
&||\mu(k+1|t) - \mu(k|t)|| \leq S<br />
\end{split}<br />
\end{equation*}<br />
Where <math>u_{min}</math>, <math>u_{max}</math>and S are the minimum/maximum control input and maximum slew rate of input respectively</math><br />
Determine the position and speed boundary based on the predicted state:<br />
\begin{equation*}<br />
\begin{split}<br />
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\<br />
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0<br />
\end{split}<br />
\end{equation*}<br />
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.<br />
<br />
== Prediction performance analysis and application to motion planning ==<br />
=== Accuracy analysis ===<br />
The proposed algorithm was compared with the results from three base algorithms, a path-following model with <br />
constant velocity, a path-following model with traffic flow and a CTRV model.<br />
<br />
We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>, <br />
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:<br />
<br />
\begin{equation*}<br />
\begin{split}<br />
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\ <br />
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\ <br />
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\ <br />
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}<br />
\end{split}<br />
\end{equation*}<br />
<center>[[Image:Figure10.1_YanYu.png|500px|]]</center><br />
<br />
The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean, <br />
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped <br />
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers' <br />
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within <br />
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore, <br />
the proposed algorithm can be precise and maintain safety simultaneously.<br />
<br />
=== Motion planning application ===<br />
==== Case study of a multi-lane left turn scenario ====<br />
The proposed method mimics a human driver better, by simulating a human driver's decision-making process. <br />
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target <br />
the vehicle, even when the target vehicle was not following the intersection guideline.<br />
<br />
==== Statistical analysis of motion planning application results ====<br />
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to <br />
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based <br />
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when <br />
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means <br />
that these cases took place sufficiently beyond the safety distance, and had little influence on determining <br />
the behaviour of the subject vehicle.<br />
<br />
<center>[[Image:Figure11_YanYu.png|500px|]]</center><br />
<br />
In order to compare the similarities between the results form the proposed algorithm and human driving decisions, <br />
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math><br />
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm, <br />
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base <br />
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm <br />
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed <br />
model is efficient and safe.<br />
<br />
== Conclusion ==<br />
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.<br />
<br />
== Future works ==<br />
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.<br />
<br />
2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.<br />
<br />
3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.<br />
<br />
4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.<br />
<br />
== Critiques ==<br />
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.<br />
<br />
This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.<br />
<br />
Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.<br />
<br />
There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.<br />
<br />
It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.<br />
<br />
The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.<br />
<br />
It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.<br />
<br />
The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.<br />
<br />
Understandably, a supervised learning problem should be evaluated on some test dataset. However, supervised learning techniques are inherently ill-suited for general planning problems. The test dataset was obtained from human driving data which is known to be extremely noisy as well as unpredictable when it comes to motion planning. It would be crucial to determine the successes of this paper based on the state-of-the-art reinforcement learning techniques.<br />
<br />
== Reference ==<br />
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.<br />
<br />
[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.<br />
<br />
[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.<br />
<br />
[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.<br />
<br />
[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].<br />
<br />
[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Efficient_kNN_Classification_with_Different_Numbers_of_Nearest_Neighbors&diff=48891Efficient kNN Classification with Different Numbers of Nearest Neighbors2020-12-02T17:41:31Z<p>Y93fang: /* Reconstruction */</p>
<hr />
<div>== Presented by == <br />
Cooper Brooke, Daniel Fagan, Maya Perelman<br />
<br />
== Introduction == <br />
Traditional model-based classification approaches first use training observations to fit a model before predicting test samples. In contrast, the model-free k-nearest neighbors (KNNs) method classifies observations with a majority rules approach, labeling each piece of test data based on its k closest training observations (neighbours). This method has become very popular due to its strong performance and simple implementation. <br />
<br />
There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples, and the second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.<br />
<br />
== Previous Work == <br />
<br />
Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications. <br />
<br />
Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.<br />
<br />
== Motivation == <br />
<br />
Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seeks to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.<br />
<br />
A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.<br />
<br />
== Approach == <br />
<br />
<br />
=== kTree Classification ===<br />
<br />
The proposed kTree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_1.png | center | 800x800px]]<br />
<br />
==== Reconstruction ====<br />
<br />
The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}\in \mathbb{R}^{d\times n} = [x_1,...,x_n]</math> represents the training set can be written as:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2<br />
\end{aligned}$$<br />
<br />
In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples. <br />
<br />
The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:<br />
<br />
$$\begin{aligned}<br />
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})<br />
\end{aligned}$$<br />
<br />
where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features.<br />
<br />
This gives a final objective function of:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})<br />
\end{aligned}$$<br />
<br />
Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.<br />
<br />
==== Calculate ''k'' for training set ====<br />
<br />
Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.<br />
<br />
For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:<br />
<br />
[[File:Approach_Figure_2.png | center | 300x300px]]<br />
<br />
The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 are non-zero.<br />
<br />
==== Train a Decision Tree using ''k'' as the label ====<br />
<br />
In a normal decision tree, the target data is the labels themselves. In contrast, in the kTree method, the target data is the optimal ''k'' value for each sample that was solved for in the previous step. So this decision tree has the following form:<br />
<br />
[[File:Approach_Figure_3.png | center | 300x300px]]<br />
<br />
==== Making Predictions for Test Data ====<br />
<br />
The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbours across '''all''' of the training data.<br />
<br />
=== k*Tree Classification ===<br />
<br />
The proposed k*Tree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_4.png | center | 1000x1000px]]<br />
<br />
Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.<br />
<br />
While all steps previous are the exact same, the k*Tree method not only stores the optimal ''k'' value but also the following information:<br />
<br />
* The training samples that have the same optimal ''k''<br />
* The ''k'' nearest neighbours of the previously identified training samples<br />
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours<br />
<br />
The data stored in each node is summarized in the following figure:<br />
<br />
[[File:Approach_Figure_5.png | center | 800x800px]]<br />
<br />
In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.<br />
<br />
== Experiments == <br />
<br />
In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:<br />
<br />
# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.<br />
# kNN-Based Applicability Domain Approach (AD-kNN) [11]<br />
# kNN Method Based on Sparse Learning (S-kNN) [10]<br />
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]<br />
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]<br />
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]<br />
<br />
The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features. <br />
<br />
<br />
'''A. Experimental Results on Different Sample Sizes'''<br />
<br />
The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.<br />
<br />
[[File:Table_I_kNN.png | center | 1000x1000px]]<br />
<br />
The following key results are noted:<br />
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.<br />
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.<br />
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.<br />
<br />
<br />
'''B. Experimental Results on Different Feature Numbers'''<br />
<br />
The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score, an algorithm that solves maximum likelihood equations numerically [15], was used to rank and select the most information features in the datasets. <br />
<br />
[[File:Table_II_kNN.png | center | 1000x1000px]]<br />
<br />
From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.<br />
<br />
== Conclusion == <br />
<br />
This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods can classify the test samples efficiently and effectively, by designing a training step that reduces the run time of the test stage and thus enhances the performance. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features.<br />
<br />
== Critiques == <br />
<br />
*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy. <br />
* The authors addressed that some of the UCI datasets contained imbalanced data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.). <br />
*While the authors contrast their ktTee and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.<br />
<br />
* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.<br />
<br />
* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.<br />
<br />
* It would be really helpful if Ktrees method can be explained at the very beginning. The transition from KNN to Ktrees is not very smooth.<br />
<br />
* It would be nice to have a comparison of the running costs of different methods to see how much cost the kTree and k*Tree reduced.<br />
<br />
* It would be better to show the key result only on a summary rather than stacking up all results without screening.<br />
<br />
* In the results section, it was mentioned that in the experiment on data sets with different numbers of features, the kTree and k*Tree model did not achieve GS-kNN or S-kNN's accuracies, but was faster in terms of running cost. It might be helpful here if the authors add some more supporting arguments about the benefit of this tradeoff, which appears to be a minor decrease in accuracy for a large improvement in speed. This could further showcase the advantages of the kTree and k*Tree models. More quantitative analysis or real-life scenario examples could be some choices here.<br />
<br />
* An interesting thing to notice while solving for the optimal matrix <math>W^*</math> that minimizes the loss function is that <math>W^*</math> is not necessarily a symmetric matrix. That is, the correlation between the <math>i^{th}</math> entry and the <math>j^{th}</math> entry is different from that between the <math>j^{th}</math> entry and the <math>i^{th}</math> entry, which makes the resulting W* not really semantically meaningful. Therefore, it would be interesting if we may set a threshold on the allowing difference between the <math>ij^{th}</math> entry and the <math>ji^{th}</math> entry in <math>W^*</math> and see if this new configuration will give better or worse results compared to current ones, which will provide better insights of the algorithm.<br />
<br />
* It would be interesting to see how the proposed model work with highly non-linear datasets. In the event it does not work well, it would pose the question: would replacing the k*Tree with a SVM or a neural network improve the accuracy? There could be experiments to show if this variant would prove superior over the original models.<br />
<br />
* The key results are a little misleading - for example the claim "the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method" is false. The kTree method had slightly lower accuracy than both GS-kNN and S-kNN and kTree was also slower than LC-kNN<br />
<br />
== References == <br />
<br />
[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.<br />
<br />
[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.<br />
<br />
[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.<br />
<br />
[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.<br />
<br />
[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.<br />
<br />
[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.<br />
<br />
[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.<br />
<br />
[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.<br />
<br />
[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.<br />
<br />
[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.<br />
<br />
[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013. <br />
<br />
[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.<br />
<br />
[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.<br />
<br />
[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.<br />
<br />
[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Improving_neural_networks_by_preventing_co-adaption_of_feature_detectors&diff=48850Improving neural networks by preventing co-adaption of feature detectors2020-12-02T06:02:47Z<p>Y93fang: /* Models for CIFAR-10 */</p>
<hr />
<div>== Presented by ==<br />
Stan Lee, Seokho Lim, Kyle Jung, Dae Hyun Kim<br />
<br />
= Introduction =<br />
In this paper, Hinton et al. introduce a novel way to improve neural networks’ performance, particularly in the case that a large feedforward neural network is trained on a small training set, which causes poor performance and Leeds to “overfitting” problem. This problem can be reduced by randomly omitting half of the feature detectors on each training case. In fact, By omitting neurons in hidden layers with a probability of 0.5, each hidden unit is prevented from relying on other hidden units being present during training. Hence there are fewer co-adaptations among them on the training data. Called “dropout,” this process is also an efficient alternative to training many separate networks and average their predictions on the test set.<br />
They used the standard, stochastic gradient descent algorithm and separated training data into mini-batches. An upper bound was set on the L2 norm of incoming weight vector for each hidden neuron, which was normalized if its size exceeds the bound. They found that using a constraint, instead of a penalty, forced model to do a more thorough search of the weight-space, when coupled with the very large learning rate that decays during training. <br />
Their dropout models included all of the hidden neurons, and their outgoing weights were halved to account for the chances of omission. The models were shown to result in lower test error rates on several datasets: MNIST; TIMIT; Reuters Corpus Volume; CIFAR-10; and ImageNet.<br />
<br />
= MNIST =<br />
The MNIST dataset contains 70,000 digit images of size 28 x 28. To see the impact of dropout, they used 4 different neural networks (784-800-800-10, 784-1200-1200-10, 784-2000-2000-10, 784-1200-1200-1200-10), using the same dropout rates as 50% for hidden neurons and 20% for visible neurons. Stochastic gradient descent was used with mini-batches of size 100 and a cross-entropy objective function as the loss function. Weights were updated after each minibatch, and training was done for 3000 epochs. An exponentially decaying learning rate <math>\epsilon</math> was used, with the initial value set as 10.0, and it was multiplied by the decaying factor <math>f</math> = 0.998 at the end of each epoch. At each hidden layer, the incoming weight vector for each hidden neuron was set an upper bound of its length, <math>l</math>, and they found from cross-validation that the results were the best when <math>l</math> = 15. Initial weights values were pooled from a normal distribution with mean 0 and standard deviation of 0.01. To update weights, an additional variable, ''p'', called momentum, was used to accelerate learning. The initial value of <math>p</math> was 0.5, and it increased linearly to the final value 0.99 during the first 500 epochs, remaining unchanged after. Also, when updating weights, the learning rate was multiplied by <math>1 – p</math>. <math>L</math> denotes the gradient of loss function.<br />
<br />
[[File:weights_mnist2.png|center|400px]]<br />
<br />
The best published result for a standard feedforward neural network was 160 errors. This was reduced to about 130 errors with 0.5 dropout and different L2 constraints for each hidden unit input weight. By omitting a random 20% of the input pixels in addition to the aforementioned changes, the number of errors was further reduced to 110. The following figure visualizes the result.<br />
[[File:mnist_figure.png|center|500px]]<br />
A publicly available pre-trained deep belief net resulted in 118 errors, and it was reduced to 92 errors when the model was fine-tuned with dropout. Another publicly available model was a deep Boltzmann machine, and it resulted in 103, 97, 94, 93 and 88 when the model was fine-tuned using standard backpropagation and was unrolled. They were reduced to 83, 79, 78, 78, and 77 when the model was fine-tuned with dropout – the mean of 79 errors was a record for models that do not use prior knowledge or enhanced training sets.<br />
<br />
= TIMIT = <br />
<br />
TIMIT dataset includes voice samples of 630 American English speakers varying across 8 different dialects. It is often used to evaluate the performance of automatic speech recognition systems. Using Kaldi, the dataset was pre-processed to extract input features in the form of log filter bank responses.<br />
<br />
=== Pre-training and Training ===<br />
<br />
For pretraining, they pretrained their neural network with a deep belief network and the first layer was built using Restricted Boltzmann Machine (RBM). Initializing visible biases with zero, weights were sampled from random numbers that followed normal distribution <math>N(0, 0.01)</math>. Each visible neuron’s variance was set to 1.0 and remained unchanged.<br />
<br />
Minimizing Contrastive Divergence (CD) was used to facilitate learning. Since momentum is used to speed up learning, it was initially set to 0.5 and increased linearly to 0.9 over 20 epochs. The average gradient had 0.001 of a learning rate which was then multiplied by <math>(1-momentum)</math> and L2 weight decay was set to 0.001. After setting up the hyperparameters, the model was done training after 100 epochs. Binary RBMs were used for training all subsequent layers with a learning rate of 0.01. Then, <math>p</math> was set as the mean activation of a neuron in the data set and the visible bias of each neuron was initialized to <math>log(p/(1 − p))</math>. Training each layer with 50 epochs, all remaining hyper-parameters were the same as those for the Gaussian RBM.<br />
<br />
=== Dropout tuning ===<br />
<br />
The initial weights were set in a neural network from the pretrained RBMs. To finetune the network with dropout-backpropagation, momentum was initially set to 0.5 and increased linearly up to 0.9 over 10 epochs. The model had a small constant learning rate of 1.0 and it was used to apply to the average gradient on a minibatch. The model also retained all other hyperparameters the same as the model from MNIST dropout finetuning. The model required approximately 200 epochs to converge. For comparison purpose, they also finetuned the same network with standard backpropagation with a learning rate of 0.1 with the same hyperparameters.<br />
<br />
=== Classification Test and Performance ===<br />
<br />
A Neural network was constructed to output the classification error rate on the test set of TIMIT dataset. They have built the neural network with four fully-connected hidden layers with 4000 neurons per layer. The output layer distinguishes distinct classes from 185 softmax output neurons that are merged into 39 classes. After constructing the neural network, 21 adjacent frames with an advance of 10ms per frame was given as an input.<br />
<br />
Comparing the performance of dropout with standard backpropagation on several network architectures and input representations, dropout consistently achieved lower error and cross-entropy. Results showed that it significantly controls overfitting, making the method robust to choices of network architecture. It also allowed much larger nets to be trained and removed the need for early stopping. Thus, neural network architectures with dropout are not very sensitive to the choice of learning rate and momentum.<br />
<br />
= Reuters Corpus Volume =<br />
Reuters Corpus Volume I archives 804,414 news documents that belong to 103 topics. Under four major themes - corporate/industrial, economics, government/social, and markets – they belonged to 63 classes. After removing 11 classes with no data and one class with insufficient data, they are left with 50 classes and 402,738 documents. The documents were divided into training and test sets equally and randomly, with each document representing the 2000 most frequent words in the dataset, excluding stopwords.<br />
<br />
They trained two neural networks, with size 2000-2000-1000-50, one using dropout and backpropagation, and the other using standard backpropagation. The training hyperparameters are the same as that in MNIST, but training was done for 500 epochs.<br />
<br />
In the following figure, we see the significant improvements by the model with dropout in the test set error. On the right side, we see that learning with dropout also proceeds smoother. <br />
<br />
[[File:reuters_figure.png|700px|center]]<br />
<br />
= CNN =<br />
<br />
Feed-forward neural networks consist of several layers of neurons where each neuron in a layer applies a linear filter to the input image data and is passed on to the neurons in the next layer. When calculating the neuron’s output, scalar bias a.k.a weights is applied to the filter with nonlinear activation function as parameters of the network that are learned by training data. [[File:cnnbigpicture.jpeg|thumb|upright=2|center|alt=text|Figure: Overview of Convolutional Neural Network]] There are several differences between Convolutional Neural networks and ordinary neural networks. The figure above gives a visual representation of a Convolutional Neural Network. First, CNN’s neurons are organized topographically into a bank and laid out on a 2D grid, so it reflects the organization of dimensions of the input data. Secondly, neurons in CNN apply filters which are local, and which are centered at the neuron’s location in the topographic organization. Meaning that useful metrics or clues to identify the object in an input image which can be found by examining local neighborhoods of the image. Next, all neurons in a bank apply the same filter at different locations in the input image. When looking at the image example, green is an input to one neuron bank, yellow is filter bank, and pink is the output of one neuron bank (convolved feature). A bank of neurons in a CNN applies a convolution operation, aka filters, to its input where a single layer in a CNN typically has multiple banks of neurons, each performing a convolution with a different filter. The resulting neuron banks become distinct input channels into the next layer. The whole process reduces the net’s representational capacity, but also reduces the capacity to overfit.<br />
[[File:bankofneurons.gif|thumb|upright=3|center|alt=text|Figure: Bank of neurons]]<br />
<br />
=== Pooling ===<br />
<br />
Pooling layer summarizes the activities of local patches of neurons in the convolutional layer by subsampling the output of a convolutional layer. Pooling is useful for extracting dominant features, to decrease the computational power required to process the data through dimensionality reduction. The procedure of pooling goes on like this; output from convolutional layers is divided into sections called pooling units and they are laid out topographically, connected to a local neighborhood of other pooling units from the same convolutional output. Then, each pooling unit is computed with some function which could be maximum and average. Maximum pooling returns the maximum value from the section of the image covered by the pooling unit while average pooling returns the average of all the values inside the pooling unit (see example). In result, there are fewer total pooling units than convolutional unit outputs from the previous layer, this is due to larger spacing between pixels on pooling layers. Using the max-pooling function reduces the effect of outliers and improves generalization.<br />
[[File:maxandavgpooling.jpeg|thumb|upright=2|center|alt=text|Figure: Max pooling and Average pooling]]<br />
<br />
=== Local Response Normalization === <br />
<br />
This network includes local response normalization layers which are implemented in lateral form and used on neurons with unbounded activations and permits the detection of high-frequency features with a big neuron response. This regularizer encourages competition among neurons belonging to different banks. Normalization is done by dividing the activity of a neuron in bank <math>i</math> at position <math>(x,y)</math> by the equation:<br />
[[File:local response norm.png|upright=2|center|]] where the sum runs over <math>N</math> ‘adjacent’ banks of neurons at the same position as in the topographic organization of neuron bank. The constants, <math>N</math>, <math>alpha</math> and <math>betas</math> are hyper-parameters whose values are determined using a validation set. This technique is replaced by better techniques such as the combination of dropout and regularization methods (<math>L1</math> and <math>L2</math>)<br />
<br />
=== Neuron nonlinearities ===<br />
<br />
All of the neurons for this model use the max-with-zero nonlinearity where output within a neuron is computed as <math> a^{i}_{x,y} = max(0, z^i_{x,y})</math> where <math> z^i_{x,y} </math> is the total input to the neuron. The reason they use nonlinearity is because it has several advantages over traditional saturating neuron models, such as significant reduction in training time required to reach a certain error rate. Another advantage is that nonlinearity reduces the need for contrast-normalization and data pre-processing since neurons do not saturate- meaning activities simply scale up little by little with usually large input values. For this model’s only pre-processing step, they subtract the mean activity from each pixel and the result is a centered data.<br />
<br />
=== Objective function ===<br />
<br />
The objective function of their network maximizes the multinomial logistic regression objective which is the same as minimizing the average cross-entropy across training cases between the true label and the model’s predicted label.<br />
<br />
=== Weight Initialization === <br />
<br />
It’s important to note that if a neuron always receives a negative value during training, it will not learn because its output is uniformly zero under the max-with-zero nonlinearity. Hence, the weights in their model were sampled from a zero-mean normal distribution with a high enough variance. High variance in weights will set a certain number of neurons with positive values for learning to happen, and in practice, it’s necessary to try out several candidates for variances until a working initialization is found. In their experiment, setting a positive constant, or 1, as biases of the neurons in the hidden layers was helpful in finding it.<br />
<br />
=== Training ===<br />
<br />
In this model, a batch size of 128 samples and momentum of 0.9, we train our model using stochastic gradient descent. The update rule for weight <math>w</math> is $$ v_{i+1} = 0.9v_i + \epsilon <\frac{dE}{dw_i}> i$$ $$w_{i+1} = w_i + v_{i+1} $$ where <math>i</math> is the iteration index, <math>v</math> is a momentum variable, <math>\epsilon</math> is the learning rate and <math>\frac{dE}{dw}</math> is the average over the <math>i</math>th batch of the derivative of the objective with respect to <math>w_i</math>. The whole training process on CIFAR-10 takes roughly 90 minutes and ImageNet takes 4 days with dropout and two days without.<br />
<br />
=== Learning ===<br />
To determine the learning rate for the network, it is a must to start with an equal learning rate for each layer which produces the largest reduction in the objective function with power of ten. Usually, it is in the order of <math>10^{-2}</math> or <math>10^{-3}</math>. In this case, they reduce the learning rate twice by a factor of ten before termination of training.<br />
<br />
= CIFAR-10 =<br />
<br />
=== CIFAR-10 Dataset ===<br />
<br />
Removing incorrect labels, The CIFAR-10 dataset is a subset of the Tiny Images dataset with 10 classes. It contains 5000 training images and 1000 testing images for each class. The dataset has 32 x 32 color images searched from the web and the images are labeled with the noun used to search the image.<br />
<br />
[[File:CIFAR-10.png|thumb|upright=2|center|alt=text|Figure 4: CIFAR-10 Sample Dataset]]<br />
<br />
=== Models for CIFAR-10 ===<br />
<br />
Two models, one with dropout and one without dropout, were built to test the performance of dropout on CIFAR-10. All models have CNN with three convolutional layers each with a pooling layer. All of the pooling payers use a stride=2 and summarize a 3*3 neighborhood. The max-pooling method is performed by the pooling layer which follows the first convolutional layer, and the average-pooling method is performed by remaining 2 pooling layers. The first and second pooling layers with <math>N = 9, α = 0.001</math>, and <math>β = 0.75</math> are followed by response normalization layers. A ten-unit softmax layer, which is used to output a probability distribution over class labels, is connected with the upper-most pooling layer. Using filter size of 5×5, all convolutional layers have 64 filter banks.<br />
<br />
Additional changes were made with the model with dropout. The model with dropout enables us to use more parameters because dropout forces a strong regularization on the network. Thus, a fourth weight layer is added to take the input from the previous pooling layer. This fourth weight layer is locally connected, but not convolutional, and contains 16 banks of filters of size 3 × 3 with 50% dropout. Lastly, the softmax layer takes its input from this fourth weight layer.<br />
<br />
Thus, with a neural network with 3 convolutional hidden layers with 3 max-pooling layers, the classification error achieved 16.6% to beat 18.5% from the best published error rate without using transformed data. The model with one additional locally-connected layer and dropout at the last hidden layer produced the error rate of 15.6%.<br />
<br />
= ImageNet =<br />
<br />
===ImageNet Dataset===<br />
<br />
ImageNet is a dataset of millions of high-resolution images, and they are labeled among 1000 different categories. The data were collected from the web and manually labeled using MTerk tool, which is a crowd-sourcing tool provided by Amazon.<br />
Because this dataset has millions of labeled images in thousands of categories, it is very difficult to have perfect accuracy on this dataset even for humans because the ImageNet images may contain multiple objects and there are a large number of object classes. ImageNet and CIFAR-10 are very similar, but the scale of ImageNet is about 20 times bigger (1,300,000 vs 60,000). The size of ImageNet is about 1.3 million training images, 50,000 validation images, and 150,000 testing images. They used resized images of 256 x 256 pixels for their experiments.<br />
<br />
'''An ambiguous example to classify:'''<br />
<br />
[[File:imagenet1.png|200px|center]]<br />
<br />
When this paper was written, the best score on this dataset was the error rate of 45.7% by High-dimensional signature compression for large-scale image classification (J. Sanchez, F. Perronnin, CVPR11 (2011)). The authors of this paper could achieve a comparable performance of 48.6% error rate using a single neural network with five convolutional hidden layers with a max-pooling layer in between, followed by two globally connected layers and a final 1000-way softmax layer. When applying 50% dropout to the 6th layer, the error rate was brought down to 42.4%.<br />
<br />
'''ImageNet Dataset:'''<br />
<br />
[[File:imagenet2.png|400px|center]]<br />
<br />
===Models for ImageNet===<br />
<br />
They mostly focused on the model with dropout because the one without dropout had a similar approach, but there was a serious issue with overfitting. They used a convolutional neural network trained by 224×224 patches randomly extracted from the 256 × 256 images. This could reduce the network’s capacity to overfit the training data and helped generalization as a form of data augmentation. The method of averaging the prediction of the net on ten 224 × 224 patches of the 256 × 256 input image was used for testing their model patched at the center, four corners, and their horizontal reflections. To maximize the performance on the validation set, this complicated network architecture was used and it was found that dropout was very effective. Also, it was demonstrated that using non-convolutional higher layers with the number of parameters worked well with dropout, but it had a negative impact to the performance without dropout.<br />
<br />
The network contains seven weight layers. The first five are convolutional, and the last two are globally-connected. Max-pooling layers follow the layer number 1,2, and 5. And then, the output of the last globally-connected layer was fed to a 1000-way softmax output layers. Using this architecture, the authors achieved the error rate of 48.6%. When applying 50% dropout to the 6th layer, the error rate was brought down to 42.4%.<br />
<br />
<br />
[[File:modelh2.png|700px|center]] <br />
<br />
[[File:layer2.png|600px|center]]<br />
<br />
Like the previous datasets, such as the MNIST, TIMIT, Reuters, and CIFAR-10, we also see a significant improvement for the ImageNet dataset. Including complicated architectures like this one, introducing dropout generalizes models better and gives lower test error rates.<br />
<br />
= Conclusion =<br />
<br />
The authors have shown a consistent improvement by the models trained with dropout in classifying objects in the following datasets: MNIST; TIMIT; Reuters Corpus Volume I; CIFAR-10; and ImageNet.<br />
<br />
= Critiques =<br />
It is a very brilliant idea to dropout half of the neurons to reduce co-adaptations. It is mentioned that for fully connected layers, dropout in all hidden layers works better than dropout in only one hidden layer. There is another paper Dropout: A Simple Way to Prevent Neural Networks from<br />
Overfitting[https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf] gives a more detailed explanation.<br />
<br />
It will be interesting to see how this paper could be used to prevent overfitting of LSTMs.<br />
<br />
Firstly, it is a very interested topic of classification by "dropout" CNN method(omitting neurons in hidden layers). If the author can briefly explain the advantages of this method in processing image data in theory, it will be easier for readers to understand. Also, how to deal with overfitting issue would be valuable.<br />
<br />
The authors mention that they tried various dropout probabilities and that the majority of them improved the model's generalization performance, but that more extreme probabilities tended to be worse which is why a dropout rate of 50% was used in the paper. The authors further develop this point to mention that the method can be improved by adapting individual dropout probabilities of each hidden or input unit using validation tests. This would be an interesting area to further develop and explore, as using a hardcoded 50% dropout for all layers might not be the optimal choice for all CNN applications. It would have been interesting to see the results of their investigations of differing dropout rates.<br />
<br />
== Reference ==<br />
[1] N. Srivastave, "Dropout: a simple way to prevent neural networks from overfitting", The Journal of Machine Learning Research, Jan 2014.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Evaluating_Machine_Accuracy_on_ImageNet&diff=48846Evaluating Machine Accuracy on ImageNet2020-12-02T05:44:55Z<p>Y93fang: /* Comparison of Human and Machine Accuracies on Image Net */</p>
<hr />
<div>== Presented by == <br />
Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du<br />
<br />
== Introduction == <br />
ImageNet is the most influential data set in machine learning with images and corresponding labels over 1000 classes. This paper intends to explore the causes for performance differences between human experts and machine learning models, more specifically, CNN, on ImageNet. <br />
<br />
Firstly, some images may fall into multiple classes. As a result, it is possible to underestimate the performance if we map each image to strictly one label, which is what is being done in the top-1 metric. Therefore, we adopt both top-1 and top-5 metrics where the performances of models, unlike human labelers, are linearly correlated in both cases.<br />
<br />
Secondly, in contrast to the uniform performance of models in classes, humans tend to achieve better performances on inanimate objects. Human labelers achieve similar overall accuracies as the models, which indicates spaces of improvements on specific classes for machines.<br />
<br />
Lastly, the setup of drawing training and test sets from the same distribution may favour models over human labelers. That is, the accuracy of multi-class prediction from models drops when the testing set is drawn from a different distribution than the training set, ImageNetV2. But this shift in distribution does not cause a problem for human labelers.<br />
<br />
== Experiment Setup ==<br />
=== Overview ===<br />
There are four main phases to the experiment, which are (i) initial multilabel annotation, (ii) human labeler training, (iii) human labeler evaluation, and (iv) final annotation overview. The five authors of the paper are the participants in the experiments. <br />
<br />
A brief overview of the four phases is as follows:<br />
[[File:Experiment Set Up.png |800px| center]]<br />
<br />
=== Initial multi-label annotation ===<br />
Three labelers A, B, and C provided multi-label annotations for a subset from the ImageNet validation set, and all images from the ImageNetV2 test sets. These experiences give A, B, and C extensive experience with the ImageNet dataset. <br />
<br />
=== Human Labeler Training === <br />
All five labelers trained on labeling a subset of the remaining ImageNet images.<br />
<br />
=== Human Labeler Evaluation ===<br />
Class-balanced random samples are generated from both the ImageNet validation set and ImageNetV2. Five participants labeled these images over 28 days.<br />
<br />
=== Final annotation Review ===<br />
All labelers reviewed the additional annotations generated in the human labeler evaluation phase.<br />
<br />
== Multi-label annotations==<br />
[[File:Categories Multilabel.png|800px|center]]<br />
<div align="center">Figure 3</div><br />
<br />
===Top-1 accuracy===<br />
With Top-1 accuracy being the standard accuracy measure used in classification studies, it measures the proportions of examples for which the predicted label matches the single target label. As many images often contain more than one object for classification, for example, Figure 3a contains a desk, laptop, keyboard, space bar, and more. With Figure 3b showing a centered prominent figure yet labeled otherwise (people vs picket fence), it can be seen how a single target label is inaccurate for such a task since identifying the main objects in the image does not suffice due to its overly stringent and punishes predictions that are the main image yet does not match its label.<br />
===Top-5 accuracy===<br />
With Top-5 considers a classification correct if the object label is in the top 5 predicted labels, it partially resolves the problem with Top-1 labeling yet it is still not ideal since it can trivialize class distinctions. For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations.<br />
===Multi-label accuracy===<br />
The paper then proposes that for every image, the image shall have a set of target labels and a prediction; if such prediction matches one of the labels, it will be considered as correct labeling. Due to the above-discussed limitations of Top-1 and Top-5 metrics, the paper claims it is necessary for rigorous accuracy evaluation on the dataset. <br />
<br />
===Types of Multi-label annotations===<br />
====Multiple objects or organisms====<br />
For the images containing more than one object or organism that corresponds to ImageNet, the paper proposed to add an additional target label for each entity in the image. With the discussed image in Figure 3b, the class groom, bow tie, suit, gown, and hoopskirt are all present in the foreground which is then subsequently added to the set of labels.<br />
====Synonym or subset relations====<br />
For similar classes, the paper considers them as under the same bigger class, that is, for two similarly labeled images, classification is considered correct if the produced label matches either one of the labels. For instance, warthog, African elephant, and Indian element all have prominent tusks, they will be considered subclasses of the tusker, Figure 3c shows a modification of labels to contain tusker as a correct label.<br />
====Unclear Image====<br />
In certain cases such as Figure 3d, there is a distinctive difficulty to determine whether a label was correct due to ambiguities in the class hierarchy.<br />
===Collecting multi-label annotations===<br />
Participants reviewed all predictions made by the models on the dataset ImageNet and ImageNet-V2, the participants then categorized every unique prediction made by the models on the dataset into correct and incorrect labels in order to allow all images to have multiple correct labels to satisfy the above-listed method.<br />
===The multi-label accuracy metric===<br />
One prediction is only correct if and only if it was marked correct by the expert reviewers during the annotation stage. As discussed in the experiment setup section, after human labelers have completed labeling, a second annotation stage is conducted. In Figure 4, a comparison of Top-1, Top-5, and multi-label accuracies showed higher Top-1 and Top-5 accuracy corresponds with higher multi-label accuracy as expected. With multi-label accuracies measures consistently higher than Top-1 yet lower than Top-5 which shows a high correlation between the three metrics, the paper concludes that multi-label metrics measures a semantically more meaningful notion of accuracy compared to its counterparts.<br />
<br />
== Human Accuracy Measurement Process ==<br />
=== Bias Control ===<br />
Since three participants participated in the initial round of annotation, they did not look at the data for six months, and two additional annotators are introduced in the final evaluation phase to ensure fairness of the experiment. <br />
<br />
=== Human Labeler Training ===<br />
The three main difficulties encountered during human labeler training are fine-grained distinctions, class unawareness, and insufficient training images. Thus, three training regimens are provided to address the problems listed above, respectively. First, labelers will be assigned extra training tasks with immediate feedbacks on similar classes. Second, labelers will be provided access to search for specific classes during labeling. Finally, the training set will contain a reasonable amount of images for each class.<br />
<br />
=== Labeling Guide ===<br />
A labeling guide is constructed to distill class analysis learned during training into discriminative traits that could be used as a reference during the final labeling evaluation.<br />
<br />
=== Final Evaluation and Review ===<br />
Two samples, each containing 1000 images, are sampled from ImageNet and ImageNetV2, respectively, They are sampled in a class-balanced manner and shuffled together. Over 28 days, all five participants labeled all images. They spent a median of 26 seconds per image. After labeling is completed, an additional multi-label annotation session was conducted, in which human predictions for all images are manually reviewed. Comparing to the initial round of labeling, 37% of the labels changes due to participants' greater familiarity with the classes.<br />
<br />
== Main Results ==<br />
[[File:Evaluating Machine Accuracy on ImageNet Figure 1.png | center]]<br />
<br />
<div align="center">Figure 1</div><br />
<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
From Figure 1, we can see that the difference in accuracies between the datasets is within 1% for all human participants. As hypothesized, human testers indeed performed better than the automated models on both datasets. It's worth noticing that labelers D and E, who did not participate in the initial annotation period, actually performed better than the best automated model.<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
Based on the results shown in Figure 1, we can see that the confidence interval of the best 4 human participants and 4 best model overlap; however, with a p-value of 0.037 using the McNemar's paired test, it rejects the hypothesis that the FixResNeXt model and Human E labeler have the same accuracy with respect to the ImageNet validation dataset. Figure 1 also shows that the confidence intervals of the labeling accuracies for human labelers C, D, E do not overlap with the confidence interval of the best model with respect to ImageNet-V2 and with the McNemar's test yielding a p-value of <math>2\times 10^{-4}</math>, it is clear that the hypothesis human and machined models have same robustness to model distribution shifts ought to be rejected.<br />
<br />
== Other Observations ==<br />
<br />
[[File: Results_Summary_Table.png| 800px|center]]<br />
<br />
=== Difficult Images ===<br />
<br />
The experiment also shed some light on images that are difficult to label. 10 images were misclassified by all of the human labelers. Among those 10 images, there was 1 image of a monkey and 9 of dogs. In addition, 27 images, with 19 in object classes and 8 in organism classes, were misclassified by all 72 machine learning models in this experiment. Only 2 images were labeled wrong by all human labelers and models. Both images contained dogs. Researchers also noted that difficult images for models are mostly images of objects and exclusively images of animals for human labelers.<br />
<br />
=== Accuracies without dogs ===<br />
<br />
As previously discussed in the paper, machine learning models tend to outperform human labelers when classifying the 118 dog classes. To better understand to what extent does models outperform human labelers, researchers computed the accuracies again by excluding all the dog classes. Results showed a 0.6% increase in accuracy on the ImageNet images using the best model and a 1.1% increase on the ImageNet V2 images. In comparison, the mean increases in accuracy for human labelers are 1.9% and 1.8% on the ImageNet and ImageNet V2 images respectively. Researchers also conducted a simulation to demonstrate that the increase in human labeling accuracy on non-dog images is significant. This simulation was done by bootstrapping to estimate the changes in accuracy when only using data for the non-dog classes, and simulation results show smaller increases than in the experiment. <br />
<br />
In conclusion, it's more difficult for human labelers to classify images with dogs than it is for machine learning models.<br />
<br />
=== Accuracies on objects ===<br />
Researchers also computed machine and human labelers' accuracies on a subset of data with only objects, as opposed to organisms, to better illustrate the differences in performance. This test involved 590 object classes. As shown in the table above, there is a 3.3% and 3.4% increase in mean accuracies for human labelers on the ImageNet and ImageNet V2 images. In contrast, there is a 0.5% decrease in accuracy for the best model on both ImageNet and ImageNet V2. This indicates that human labelers are much better at classifying objects than these models are.<br />
<br />
=== Accuracies on fast images ===<br />
Unlike the CNN models, human labelers spent different amounts of time on different images, spanning from several seconds to 40 minutes. To further analyze the images that take human labelers less time to classify, researchers took a subset of images with median labeling time spent by human labelers of at most 60 seconds. These images were referred to as "fast images". There are 756 and 714 fast images from ImageNet and ImageNet V2 respectively, out of the total 2000 images used for evaluation. Accuracies of models and humans on the fast images increased significantly, especially for humans. <br />
<br />
This result suggests that human labelers know when an image is difficult to label and would spend more time on it. It also shows that the models are more likely to correctly label images that human labelers can label relatively quickly.<br />
<br />
== Related Work ==<br />
<br />
=== Human accuracy on ImageNet ===<br />
<br />
Russakovsky et al. (2015) studied two trained human labelers' accuracies on 1500 and 258 images in the context of the ImageNet challenge. The top-5 accuracy of the labeler who labeled 1500 images was the well-known human baseline on ImageNet. <br />
<br />
As introduced before, the researchers went beyond by using multi-label accuracy, using more labelers, and focusing on robustness to small distribution shifts. Although the researchers had some different findings, some results are also consistent with results from (Russakovsky et al., 2015). An example is that both experiments indicated that it takes human labelers around one minute to label an image. The time distribution also has a long tail, due to the difficult images as mentioned before.<br />
<br />
=== Human performance in computer vision broadly ===<br />
There are many examples of recent studies about humans in the area of computer vision, such as investigating human robustness to synthetic distribution change (Geirhos et al., 2017) and studying what characteristics do humans use to recognize objects (Geirhos et al., 2018). Other examples include the adversarial examples constructed to fool both machines and time-limited humans (Elsayed et al., 2018) and illustrating foreground/background objects' effects on human and machine performance (Zhu et al., 2016). <br />
<br />
=== Multi-label annotations ===<br />
Stock & Cissé (2017) also studied ImageNet's multi-label nature, which aligns with the researchers' study in this paper. According to Stock & Cissé (2017), the top-1 accuracy measure could underestimate multi-label by up to 13.2%.<br />
<br />
=== ImageNet inconsistencies and label error ===<br />
Researches have found and recorded some incorrectly labeled images from ImageNet and ImageNet V2 during this study. Earlier studies (Van Horn et al., 2015) also shown that at least 4% of the birds in ImageNet are misclassified. This work also noted that the inconsistent taxonomic structure in birds' classes could lead to weak class boundaries. Researchers also noted that the majority of the fine-grained organism classes also had similar taxonomic issues.<br />
<br />
=== Distribution shift ===<br />
There has been an increasing amount of studies in this area. One focus of the studies is distributionally robust optimization (DRO), which finds the model that has the smallest worst-case expected error over a set of probability distributions. Another focus is on finding the model with the lowest error rates on adversarial examples. Work in both areas has been productive, but none was shown to resolve the drop in accuracies between ImageNet and ImageNet V2. A recent [https://papers.nips.cc/paper/2019/file/8558cb408c1d76621371888657d2eb1d-Paper.pdf paper] also discusses quantifying uncertainty under a distribution shift, in other words whether the output of probabilistic deep learning models should or should not be trusted.<br />
<br />
== Conclusion and Future Work ==<br />
<br />
=== Conclusion ===<br />
Researchers noted that in order to achieve truly reliable machine learning, people need a deeper understanding of what changes to input models should be robust to. This study has provided valuable insights into the desired robustness properties by comparing model performance to human performance. This is especially evident given the results of the experiment which show humans drastically outperforming machine learning in many cases and proposes the question of how much accuracy one is willing to give up in exchange for efficiency. The results have shown that current performance benchmarks are not addressing the robustness to small and natural distribution shifts, which are easily handled by humans.<br />
<br />
=== Future work ===<br />
Other than improving the robustness of models, researchers should consider investigating if less-trained human labelers can achieve a similar level of robustness to distributional shifts. In addition, researchers can study the robustness to temporal changes, which is another form of natural distribution shift (Gu et al., 2019; Shankar et al., 2019). Also, Convolutional Neural Network can be a candidate to improve the accuracy of classifying images.<br />
<br />
== Critiques ==<br />
<br />
# Table 1 simply showed a difference in ImageNet multi-label accuracy yet does not give an explicit reason as to why such a difference is present. Although the paper suggested the distribution shift has caused the difference, it does not give other factors to concretely explain why the distribution shift was the cause.<br />
# With the recommendation to future machine evaluations, the paper proposed to "Report performances on dogs, other animals, and inanimate objects separately.". Despite its intentions, it is narrowly specific and requires further generalization for it to be convincing. <br />
# With choosing human subjects as samplers, no further information was given as to how they are chosen nor there are any background information was given. As it is a classification problem involving many classes as specific to species, a biology student would give far more accurate results than a computer science student or a math student. <br />
# As explaining the importance of multi-label metrics using comparison to Top-5 metric, the turtle example falls within the overall similarity (simony) classification of the multi-label evaluation metric, as such, if the Top-5 evaluation suggests any one of the turtle species were selected, the algorithm is considered to produce a correct prediction which is the intention. The example does not convey the necessity of changing to the proposed metric over the Top-5 metric. <br />
# With the definition in the paper regarding multi-label metrics, it is hard to see why expanding the label set is different from a traditional Top-5 metric or rather necessary, ergo does not yield the claim which the proposed metric is necessary for rigorous accuracy evaluation on ImageNet.<br />
# When discussing the main results, the paper discusses the hypothesis on distribution shift having no effects on human and machine model accuracies; the presentation is poor at best with no clear centric to what they are trying to convey to how (in detail) they resulted in such claims.<br />
# In the experiment setup of the presentation, there are a lot of key terms without detailed description. For example, Human labeler training using a subset of the remaining 30,000 unannotated images in the ImageNet validation set, labelers A, B, C, D, and E underwent extensive training to understand the intricacies of fine-grained class distinctions in the ImageNet class hierarchy. Authors should clarify each key term in the presentation otherwise readers are hard to follow.<br />
# Not sure how the human samplers were determined and simply picking several people will have really high bias because the sample is too small and they have different background which will definitely affect the results a lot. Also, it will be better if there are more comparisons between the model introduced and other models.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Speech2Face:_Learning_the_Face_Behind_a_Voice&diff=48841Speech2Face: Learning the Face Behind a Voice2020-12-02T05:38:56Z<p>Y93fang: /* Model Architecture */</p>
<hr />
<div>== Presented by == <br />
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang<br />
<br />
== Introduction ==<br />
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.<br />
<br />
== Previous Work ==<br />
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, separate speech from multiple concurrent sources, predict lip motion from speech, and even learn the emotion of the agents based on their voices.Aytar et al. [6] proposed a student-teacher training procedure in which a well established visual recognition model was used to transfer the knowledge obtained in the visual modality to the sound modality, using unlabeled videos.<br />
<br />
Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.<br />
<br />
== Motivation ==<br />
It seems to be a common trait among humans to imagine what some people look like when we hear their voices before we have seen what they look lke. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.<br />
<br />
== Model Architecture == <br />
<br />
'''Speech2Face model and training pipeline'''<br />
<br />
[[File:ModelFramework.jpg|center]]<br />
<br />
<div style="text-align:center;"> Figure 1. '''Speech2Face model and training pipeline''' </div><br />
<br />
<br />
<br />
The Speech2Face Model used to achieve the desired result consists of two parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network. <br />
<br />
'''Voice Encoder Architecture''' <br />
<br />
[[File:VoiceEncoderArch.JPG|center]]<br />
<br />
<div style="text-align:center;"> Table 1: '''Voice encoder architecture''' </div><br />
<br />
<br />
<br />
The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time and allows the model to be used for input speeches of varying lengths. Two fully connected layers at the end are used to return a 4096-dimensional facial feature output.<br />
<br />
'''Training'''<br />
<br />
The AVSSpeech dataset, a large-scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair color, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.<br />
<br />
The voice encoder is trained in a self-supervised manner. A frame that contains the face is extracted from each video and then inputted to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096-dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.<br />
<br />
In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math>: <math> \mathbb{R}^{4096} \to \mathbb{R}^{2622}</math> and the first layer <math>f_{dec}</math> : <math> \mathbb{R}^{4096} \to \mathbb{R}^{1000}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$<br />
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}(a)\text{log}p_{(i)}(b)$$ $$p_{(i)}(a) = \frac{\text{exp}(a_i/T)}{\sum_j \text{exp}(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.<br />
<br />
<center><br />
[[File:L1vsTotalLoss.png | 700px]]<br />
</center><br />
<br />
<div style="text-align:center;"> Figure 2: '''Qualitative results on the AVSpeech test set''' </div><br />
<br />
== Results ==<br />
<br />
'''Confusion Matrix and Dataset statistics'''<br />
<br />
<center><br />
[[File:Confusionmatrix.png| 600px]]<br />
</center><br />
<br />
<div style="text-align:center;"> Figure 3. '''Facial attribute evaluation''' </div><br />
<br />
<br />
<br />
In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face. <br />
<br />
'''Feature Similarity'''<br />
<br />
<center><br />
[[File:FeatSim.JPG]]<br />
</center><br />
<br />
<div style="text-align:center;"> Table 2. '''Feature similarity''' </div><br />
<br />
<br />
<br />
Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth. <br />
<br />
'''S2F -> Face retrieval performance'''<br />
<br />
<center><br />
[[File: Retrieval.JPG]]<br />
</center><br />
<br />
<div style="text-align:center;"> Table 3. '''S2F -> Face retrieval performance''' </div><br />
<br />
<br />
<br />
The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From Table 3, above, we see that both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.<br />
<br />
'''Additional Observations''' <br />
<br />
Ablation studies were carried out to test the effect of audio duration and batch normalization. It was found that the duration of input audio during the training stage had little effect on convergence speed (comparing 3 and 6-second speech segments), while in the test stage longer input speech yields improvement in reconstruction quality. With respect to batch normalization (BN), it was found that without BN reconstructed faces would converge to an average face, while the inclusion of BN led to results which contained much richer facial features.<br />
<br />
== Conclusion ==<br />
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.<br />
<br />
== Discussion and Critiques ==<br />
<br />
There is evidence that the results of the model may be heavily influenced by external factors:<br />
<br />
1. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the model's prediction of ethnicity towards white. The bias in the results shows that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. <br />
<br />
2. The model was shown to infer different face features based on language. This puts into question how heavily the model depends on the spoken language. The paper mentioned the quality of face reconstruction may be affected by uncommon languages, where English is the most popular language on Youtube(training set). Testing a more controlled sample where all speech recording was of the same language may help address this concern to determine the model's reliance on spoken language.<br />
<br />
3. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender, and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face and may result in a compounding effect on the miss-classification rate. <br />
<br />
4. Figure 4.b shows the AVSpeech dataset statistics. However, it doesn't show the statistics about speakers' ethnicity and the language of the video. If we train the model with a more comprehensive dataset that includes enough Asian/Indian English speakers and native language speakers will this increase the accuracy?<br />
<br />
5. One concern about the source of the training data, i.e. the Youtube videos, is that resolution varies a lot since the videos are randomly selected. That may be the reason why the proposed model performs badly on some certain features. For example, it is hard to tell the age when the resolution is bad because the wrinkles on the face are neglected.<br />
<br />
6. The topic of this project is very interesting, but I highly doubt this model will be practical in real world problems. Because there are many factors to affect a person's sound in the real world environment. Sounds such like phone clock, TV, car horn and so on. These sounds will decrease the accuracy of the predicted result of the model.<br />
<br />
7. A lot of information can be obtained from someone's voice, this can potentially be useful for detective work and crime scene investigation. In our world of increasing surveillance, public voice recording is quite common and we can reconstruct images of potential suspects based on their voice. In order for this to be achieved, the model has to be thoroughly trained and tested to avoid false positives as it could have highly destructive outcome for a falsely convicted suspect.<br />
<br />
8. This is a very interesting topic, and this summary has a good structure for readers. Since this model uses Youtube to train model, but I think one problem is that most of the YouTubers are adult, and there are many additional reasons that make this dataset be highly unbalanced. What is more, some people may have a baby voice, this also could affect the performance of the model. But overall, this is a meaningful topic, it might help police to locate the suspects. So it might be interesting to apply this to the police.<br />
<br />
9. In addition, it seems very unlikely that any results coming from this model would ever be held in a regard even remotely close to being admissible in court to identify a person of interest until the results are improved and the model can be shown to work in real world applications. Otherwise, there seems to be very little use for such technology and it could have negative impacts on people if they were to be depicted in an unflattering way by the model based on their voice.<br />
<br />
10. Using voice as a factor of constructing the face is a good idea, but it seems like the data they have will have lots of noises and bias. The voice of a video might not come from the person in the video. There are so many Youtubers adjusting their voices before uploading their video and it's really hard to know whether they adjust their voice. Also, most Youtubers are adults so the model cannot have enough training samples about teenagers and kids.<br />
<br />
== References ==<br />
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In<br />
IEEE International Conference on Computer Vision (ICCV),<br />
2017.<br />
<br />
[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International<br />
Conference on Acoustics, Speech and Signal Processing<br />
(ICASSP), 2019.<br />
<br />
[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.<br />
<br />
[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.<br />
<br />
[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Speech2Face:_Learning_the_Face_Behind_a_Voice&diff=48839Speech2Face: Learning the Face Behind a Voice2020-12-02T05:37:41Z<p>Y93fang: /* Model Architecture */</p>
<hr />
<div>== Presented by == <br />
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang<br />
<br />
== Introduction ==<br />
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.<br />
<br />
== Previous Work ==<br />
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, separate speech from multiple concurrent sources, predict lip motion from speech, and even learn the emotion of the agents based on their voices.Aytar et al. [6] proposed a student-teacher training procedure in which a well established visual recognition model was used to transfer the knowledge obtained in the visual modality to the sound modality, using unlabeled videos.<br />
<br />
Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.<br />
<br />
== Motivation ==<br />
It seems to be a common trait among humans to imagine what some people look like when we hear their voices before we have seen what they look lke. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.<br />
<br />
== Model Architecture == <br />
<br />
'''Speech2Face model and training pipeline'''<br />
<br />
[[File:ModelFramework.jpg|center]]<br />
<br />
<div style="text-align:center;"> Figure 1. '''Speech2Face model and training pipeline''' </div><br />
<br />
<br />
<br />
The Speech2Face Model used to achieve the desired result consists of two parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network. <br />
<br />
'''Voice Encoder Architecture''' <br />
<br />
[[File:VoiceEncoderArch.JPG|center]]<br />
<br />
<div style="text-align:center;"> Table 1: '''Voice encoder architecture''' </div><br />
<br />
<br />
<br />
The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time and allows the model to be used for input speeches of varying lengths. Two fully connected layers at the end are used to return a 4096-dimensional facial feature output.<br />
<br />
'''Training'''<br />
<br />
The AVSSpeech dataset, a large-scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair color, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.<br />
<br />
The voice encoder is trained in a self-supervised manner. A frame that contains the face is extracted from each video and then inputted to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096-dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.<br />
<br />
In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math> : <math> \mathbb{R}^{4096} \to \mathbb{R}^{1000}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$<br />
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}(a)\text{log}p_{(i)}(b)$$ $$p_{(i)}(a) = \frac{\text{exp}(a_i/T)}{\sum_j \text{exp}(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.<br />
<br />
<center><br />
[[File:L1vsTotalLoss.png | 700px]]<br />
</center><br />
<br />
<div style="text-align:center;"> Figure 2: '''Qualitative results on the AVSpeech test set''' </div><br />
<br />
== Results ==<br />
<br />
'''Confusion Matrix and Dataset statistics'''<br />
<br />
<center><br />
[[File:Confusionmatrix.png| 600px]]<br />
</center><br />
<br />
<div style="text-align:center;"> Figure 3. '''Facial attribute evaluation''' </div><br />
<br />
<br />
<br />
In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face. <br />
<br />
'''Feature Similarity'''<br />
<br />
<center><br />
[[File:FeatSim.JPG]]<br />
</center><br />
<br />
<div style="text-align:center;"> Table 2. '''Feature similarity''' </div><br />
<br />
<br />
<br />
Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth. <br />
<br />
'''S2F -> Face retrieval performance'''<br />
<br />
<center><br />
[[File: Retrieval.JPG]]<br />
</center><br />
<br />
<div style="text-align:center;"> Table 3. '''S2F -> Face retrieval performance''' </div><br />
<br />
<br />
<br />
The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From Table 3, above, we see that both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.<br />
<br />
'''Additional Observations''' <br />
<br />
Ablation studies were carried out to test the effect of audio duration and batch normalization. It was found that the duration of input audio during the training stage had little effect on convergence speed (comparing 3 and 6-second speech segments), while in the test stage longer input speech yields improvement in reconstruction quality. With respect to batch normalization (BN), it was found that without BN reconstructed faces would converge to an average face, while the inclusion of BN led to results which contained much richer facial features.<br />
<br />
== Conclusion ==<br />
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.<br />
<br />
== Discussion and Critiques ==<br />
<br />
There is evidence that the results of the model may be heavily influenced by external factors:<br />
<br />
1. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the model's prediction of ethnicity towards white. The bias in the results shows that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. <br />
<br />
2. The model was shown to infer different face features based on language. This puts into question how heavily the model depends on the spoken language. The paper mentioned the quality of face reconstruction may be affected by uncommon languages, where English is the most popular language on Youtube(training set). Testing a more controlled sample where all speech recording was of the same language may help address this concern to determine the model's reliance on spoken language.<br />
<br />
3. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender, and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face and may result in a compounding effect on the miss-classification rate. <br />
<br />
4. Figure 4.b shows the AVSpeech dataset statistics. However, it doesn't show the statistics about speakers' ethnicity and the language of the video. If we train the model with a more comprehensive dataset that includes enough Asian/Indian English speakers and native language speakers will this increase the accuracy?<br />
<br />
5. One concern about the source of the training data, i.e. the Youtube videos, is that resolution varies a lot since the videos are randomly selected. That may be the reason why the proposed model performs badly on some certain features. For example, it is hard to tell the age when the resolution is bad because the wrinkles on the face are neglected.<br />
<br />
6. The topic of this project is very interesting, but I highly doubt this model will be practical in real world problems. Because there are many factors to affect a person's sound in the real world environment. Sounds such like phone clock, TV, car horn and so on. These sounds will decrease the accuracy of the predicted result of the model.<br />
<br />
7. A lot of information can be obtained from someone's voice, this can potentially be useful for detective work and crime scene investigation. In our world of increasing surveillance, public voice recording is quite common and we can reconstruct images of potential suspects based on their voice. In order for this to be achieved, the model has to be thoroughly trained and tested to avoid false positives as it could have highly destructive outcome for a falsely convicted suspect.<br />
<br />
8. This is a very interesting topic, and this summary has a good structure for readers. Since this model uses Youtube to train model, but I think one problem is that most of the YouTubers are adult, and there are many additional reasons that make this dataset be highly unbalanced. What is more, some people may have a baby voice, this also could affect the performance of the model. But overall, this is a meaningful topic, it might help police to locate the suspects. So it might be interesting to apply this to the police.<br />
<br />
9. In addition, it seems very unlikely that any results coming from this model would ever be held in a regard even remotely close to being admissible in court to identify a person of interest until the results are improved and the model can be shown to work in real world applications. Otherwise, there seems to be very little use for such technology and it could have negative impacts on people if they were to be depicted in an unflattering way by the model based on their voice.<br />
<br />
10. Using voice as a factor of constructing the face is a good idea, but it seems like the data they have will have lots of noises and bias. The voice of a video might not come from the person in the video. There are so many Youtubers adjusting their voices before uploading their video and it's really hard to know whether they adjust their voice. Also, most Youtubers are adults so the model cannot have enough training samples about teenagers and kids.<br />
<br />
== References ==<br />
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In<br />
IEEE International Conference on Computer Vision (ICCV),<br />
2017.<br />
<br />
[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International<br />
Conference on Acoustics, Speech and Signal Processing<br />
(ICASSP), 2019.<br />
<br />
[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.<br />
<br />
[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.<br />
<br />
[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Speech2Face:_Learning_the_Face_Behind_a_Voice&diff=48838Speech2Face: Learning the Face Behind a Voice2020-12-02T05:36:33Z<p>Y93fang: /* Model Architecture */</p>
<hr />
<div>== Presented by == <br />
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang<br />
<br />
== Introduction ==<br />
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.<br />
<br />
== Previous Work ==<br />
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, separate speech from multiple concurrent sources, predict lip motion from speech, and even learn the emotion of the agents based on their voices.Aytar et al. [6] proposed a student-teacher training procedure in which a well established visual recognition model was used to transfer the knowledge obtained in the visual modality to the sound modality, using unlabeled videos.<br />
<br />
Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.<br />
<br />
== Motivation ==<br />
It seems to be a common trait among humans to imagine what some people look like when we hear their voices before we have seen what they look lke. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.<br />
<br />
== Model Architecture == <br />
<br />
'''Speech2Face model and training pipeline'''<br />
<br />
[[File:ModelFramework.jpg|center]]<br />
<br />
<div style="text-align:center;"> Figure 1. '''Speech2Face model and training pipeline''' </div><br />
<br />
<br />
<br />
The Speech2Face Model used to achieve the desired result consists of two parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network. <br />
<br />
'''Voice Encoder Architecture''' <br />
<br />
[[File:VoiceEncoderArch.JPG|center]]<br />
<br />
<div style="text-align:center;"> Table 1: '''Voice encoder architecture''' </div><br />
<br />
<br />
<br />
The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time and allows the model to be used for input speeches of varying lengths. Two fully connected layers at the end are used to return a 4096-dimensional facial feature output.<br />
<br />
'''Training'''<br />
<br />
The AVSSpeech dataset, a large-scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair color, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.<br />
<br />
The voice encoder is trained in a self-supervised manner. A frame that contains the face is extracted from each video and then inputted to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096-dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.<br />
<br />
In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math> : <math> \mathbb{R}^{4096} \textrightarrow \mathbb{R}^{1000}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$<br />
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}(a)\text{log}p_{(i)}(b)$$ $$p_{(i)}(a) = \frac{\text{exp}(a_i/T)}{\sum_j \text{exp}(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.<br />
<br />
<center><br />
[[File:L1vsTotalLoss.png | 700px]]<br />
</center><br />
<br />
<div style="text-align:center;"> Figure 2: '''Qualitative results on the AVSpeech test set''' </div><br />
<br />
== Results ==<br />
<br />
'''Confusion Matrix and Dataset statistics'''<br />
<br />
<center><br />
[[File:Confusionmatrix.png| 600px]]<br />
</center><br />
<br />
<div style="text-align:center;"> Figure 3. '''Facial attribute evaluation''' </div><br />
<br />
<br />
<br />
In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face. <br />
<br />
'''Feature Similarity'''<br />
<br />
<center><br />
[[File:FeatSim.JPG]]<br />
</center><br />
<br />
<div style="text-align:center;"> Table 2. '''Feature similarity''' </div><br />
<br />
<br />
<br />
Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth. <br />
<br />
'''S2F -> Face retrieval performance'''<br />
<br />
<center><br />
[[File: Retrieval.JPG]]<br />
</center><br />
<br />
<div style="text-align:center;"> Table 3. '''S2F -> Face retrieval performance''' </div><br />
<br />
<br />
<br />
The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From Table 3, above, we see that both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.<br />
<br />
'''Additional Observations''' <br />
<br />
Ablation studies were carried out to test the effect of audio duration and batch normalization. It was found that the duration of input audio during the training stage had little effect on convergence speed (comparing 3 and 6-second speech segments), while in the test stage longer input speech yields improvement in reconstruction quality. With respect to batch normalization (BN), it was found that without BN reconstructed faces would converge to an average face, while the inclusion of BN led to results which contained much richer facial features.<br />
<br />
== Conclusion ==<br />
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.<br />
<br />
== Discussion and Critiques ==<br />
<br />
There is evidence that the results of the model may be heavily influenced by external factors:<br />
<br />
1. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the model's prediction of ethnicity towards white. The bias in the results shows that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. <br />
<br />
2. The model was shown to infer different face features based on language. This puts into question how heavily the model depends on the spoken language. The paper mentioned the quality of face reconstruction may be affected by uncommon languages, where English is the most popular language on Youtube(training set). Testing a more controlled sample where all speech recording was of the same language may help address this concern to determine the model's reliance on spoken language.<br />
<br />
3. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender, and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face and may result in a compounding effect on the miss-classification rate. <br />
<br />
4. Figure 4.b shows the AVSpeech dataset statistics. However, it doesn't show the statistics about speakers' ethnicity and the language of the video. If we train the model with a more comprehensive dataset that includes enough Asian/Indian English speakers and native language speakers will this increase the accuracy?<br />
<br />
5. One concern about the source of the training data, i.e. the Youtube videos, is that resolution varies a lot since the videos are randomly selected. That may be the reason why the proposed model performs badly on some certain features. For example, it is hard to tell the age when the resolution is bad because the wrinkles on the face are neglected.<br />
<br />
6. The topic of this project is very interesting, but I highly doubt this model will be practical in real world problems. Because there are many factors to affect a person's sound in the real world environment. Sounds such like phone clock, TV, car horn and so on. These sounds will decrease the accuracy of the predicted result of the model.<br />
<br />
7. A lot of information can be obtained from someone's voice, this can potentially be useful for detective work and crime scene investigation. In our world of increasing surveillance, public voice recording is quite common and we can reconstruct images of potential suspects based on their voice. In order for this to be achieved, the model has to be thoroughly trained and tested to avoid false positives as it could have highly destructive outcome for a falsely convicted suspect.<br />
<br />
8. This is a very interesting topic, and this summary has a good structure for readers. Since this model uses Youtube to train model, but I think one problem is that most of the YouTubers are adult, and there are many additional reasons that make this dataset be highly unbalanced. What is more, some people may have a baby voice, this also could affect the performance of the model. But overall, this is a meaningful topic, it might help police to locate the suspects. So it might be interesting to apply this to the police.<br />
<br />
9. In addition, it seems very unlikely that any results coming from this model would ever be held in a regard even remotely close to being admissible in court to identify a person of interest until the results are improved and the model can be shown to work in real world applications. Otherwise, there seems to be very little use for such technology and it could have negative impacts on people if they were to be depicted in an unflattering way by the model based on their voice.<br />
<br />
10. Using voice as a factor of constructing the face is a good idea, but it seems like the data they have will have lots of noises and bias. The voice of a video might not come from the person in the video. There are so many Youtubers adjusting their voices before uploading their video and it's really hard to know whether they adjust their voice. Also, most Youtubers are adults so the model cannot have enough training samples about teenagers and kids.<br />
<br />
== References ==<br />
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In<br />
IEEE International Conference on Computer Vision (ICCV),<br />
2017.<br />
<br />
[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International<br />
Conference on Acoustics, Speech and Signal Processing<br />
(ICASSP), 2019.<br />
<br />
[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.<br />
<br />
[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.<br />
<br />
[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Research_Papers_Classification_System&diff=48833Research Papers Classification System2020-12-02T05:26:26Z<p>Y93fang: /* Term Frequency (TF) */</p>
<hr />
<div>= Presented by =<br />
Jill Wang, Junyi (Jay) Yang, Yu Min (Chris) Wu, Chun Kit (Calvin) Li<br />
<br />
= Introduction =<br />
This paper introduces a paper classification system that utilizes the Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and K-means clustering. The most important technology the system used to process big data is the Hadoop Distributed File Systems (HDFS). The system can handle quantitatively complex research paper classification problems efficiently and accurately.<br />
<br />
===General Framework===<br />
<br />
The paper classification system classifies research papers based on the abstracts given that the core of most papers is presented in the abstracts. <br />
<br />
<ol><li>Paper Crawling <br />
<p>Collects abstracts from research papers published during a given period</p></li><br />
<li>Preprocessing<br />
<p> <ol style="list-style-type:lower-alpha"><li>Removes stop words in the papers crawled, in which only nouns are extracted from the papers</li><br />
<li>generates a keyword dictionary, keeping only the top-N keywords with the highest frequencies</li> </ol><br />
</p></li> <br />
<li>Topic Modelling<br />
<p> Use the LDA to group the keywords into topics</p><br />
</li><br />
<li>Paper Length Calculation<br />
<p> Calculates the total number of occurrences of words to prevent an unbalanced TF values caused by the various length of abstracts using the map-reduce algorithm</p><br />
</li><br />
<li>Word Frequency Calculation<br />
<p> Calculates the Term Frequency (TF) values which represent the frequency of keywords in a research paper</p><br />
</li><br />
<li>Document Frequency Calculation<br />
<p> Calculates the Document Frequency (DF) values which represents the frequency of keywords in a collection of research papers. The higher the DF value, the lower the importance of a keyword.</p><br />
</li><br />
<li>TF-IDF calculation<br />
<p> Calculates the inverse of the DF which represents the importance of a keyword.</p><br />
</li><br />
<li>Paper Classification<br />
<p> Classify papers by topics using the K-means clustering algorithm.</p><br />
</li><br />
</ol><br />
<br />
===Technologies===<br />
<br />
The HDFS with a Hadoop cluster composed of one master node, one sub node, and four data nodes is what is used to process the massive paper data. Hadoop-2.6.5 version in Java is what is used to perform the TF-IDF calculation. Spark MLlib is what is used to perform the LDA. The Scikit-learn library is what is used to perform the K-means clustering.<br />
<br />
===HDFS===<br />
<br />
Hadoop Distributed File Systems was used to process big data in this system. What Hadoop does is to break a big collection of data into different partitions and pass each partition to one individual processor. Each processor will only have information about the partition of data it received.<br />
<br />
'''In this summary, we are going to focus on introducing the main algorithms of what this system uses, namely LDA, TF-IDF, and K-Means.'''<br />
<br />
=Data Preprocessing=<br />
===Crawling of Abstract Data===<br />
<br />
Under the assumption that audiences tend to first read the abstract of a paper to gain an overall understanding of the material, it is reasonable to assume the abstract section includes “core words” that can be used to effectively classify a paper's subject.<br />
<br />
An abstract is crawled to have its stop words removed. Stop words are words that are usually ignored by search engines, such as “the”, “a”, and etc. Afterwards, nouns are extracted, as a more condensed representation for efficient analysis.<br />
<br />
This is managed on HDFS. The TF-IDF value of each paper is calculated through map-reduce.<br />
<br />
===Managing Paper Data===<br />
<br />
To construct an effective keyword dictionary using abstract data and keywords data in all of the crawled papers, the authors categorized keywords with similar meanings using a single representative keyword. The approach is called stemming, which is common in cleaning data. 1394 keyword categories are extracted, which is still too much to compute. Hence, only the top 30 keyword categories are used.<br />
<br />
<div align="center">[[File:table_1_kswf.JPG|700px]]</div><br />
<br />
=Topic Modeling Using LDA=<br />
<br />
Latent Dirichlet allocation (LDA) is a generative probabilistic model that views documents as random mixtures over latent topics. Each topic is a distribution over words, and the goal is to extract these topics from documents.<br />
<br />
LDA estimates the topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> using Dirichlet priors for the distributions with a fixed number of topics. For each document, obtain a feature vector:<br />
<br />
\[F = \left( P\left(z_1 | d\right), P\left(z_2 | d\right), \cdots, P\left(z_k | d\right) \right)\]<br />
<br />
In the paper, authors extract topics from preprocessed paper to generate three kinds of topic sets, each with 10, 20, and 30 topics respectively. The following is a table of the 10 topic sets of highest frequency keywords.<br />
<br />
<div align="center">[[File:table_2_tswtebls.JPG|700px]]</div><br />
<br />
<br />
===LDA Intuition===<br />
<br />
LDA uses the Dirichlet priors of the Dirichlet distribution. The following picture illustrates 2-simplex Dirichlet distributions with different alpha values, one for each corner of the triangles. <br />
<br />
<div align="center">[[File:dirichlet_dist.png|700px]]</div><br />
<br />
Simplex is a generalization of the notion of a triangle. In Dirichlet distribution, each parameter will be represented by a corner in simplex, so adding additional parameters implies increasing the dimensions of simplex. As illustrated, when alphas are smaller than 1 the distribution is dense at the corners. When the alphas are greater than 1 the distribution is dense at the centers.<br />
<br />
The following illustration shows an example LDA with 3 topics, 4 words and 7 documents.<br />
<br />
<div align="center">[[File:LDA_example.png|800px]]</div><br />
<br />
In the left diagram, there are three topics, hence it is a 2-simplex. In the right diagram there are four words, hence it is a 3-simplex. LDA essentially adjusts parameters in Dirichlet distributions and multinomial distributions (represented by the points), such that, in the left diagram, all the yellow points representing documents and, in the right diagram, all the points representing topics, are as close to a corner as possible. In other words, LDA finds topics for documents and also finds words for topics. At the end topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> are produced.<br />
<br />
=Term Frequency Inverse Document Frequency (TF-IDF) Calculation=<br />
<br />
TF-IDF is widely used to evaluate the importance of a set of words in the fields of information retrieval and text mining. It is a combination of term frequency (TF) and inverse document frequency (IDF). The idea behind this combination is<br />
It evaluates the importance of a word within a document, and<br />
It evaluates the importance of the word among the collection of all documents<br />
<br />
The TF-IDF formula has the following form:<br />
<br />
\[TF-IDF_{i,j} = TF_{i,j} \times IDF_{i}\]<br />
<br />
where i stands for the <math>i^{th}</math> word and j stands for the <math>j^{th}</math> document.<br />
<br />
===Term Frequency (TF)===<br />
<br />
TF evaluates the percentage of a given word in a document. Thus, TF value indicates the importance of a word. The TF has a positive relation with the importance.<br />
<br />
In this paper, we only calculate TF for words in the keyword dictionary obtained. For a given keyword i, <math>TF_{i,j}</math> is the number of times word i appears in document j divided by the total number of words in document j.<br />
<br />
The formula for TF has the following form:<br />
<br />
\[TF_{i,j} = \frac{n_{i,j} }{\sum_k n_{k,j} }\]<br />
<br />
where i stands for the <math>i^{th}</math> word, j stands for the <math>j^{th}</math> document, <math>n_{i,j}</math> stands for the number of times words <math>t_i</math> appear in document <math>d_j</math> and <math>\sum_k n_{k,j} </math> stands for total number of occurence of words in document <math>d_j</math>.<br />
<br />
Note that the denominator is the total number of words remaining in document j after crawling.<br />
<br />
===Document Frequency (DF)===<br />
<br />
DF evaluates the percentage of documents that contain a given word over the entire collection of documents. Thus, the higher DF value is, the less important the word is. Since DF and the importance of the word have an inverse relation, we use IDF instead of DF.<br />
<br />
<math>DF_{i}</math> is the number of documents in the collection with word i divided by the total number of documents in the collection. The formula for DF has the following form:<br />
<br />
\[DF_{i} = \frac{|d_k \in D: n_{i,k} > 0|}{|D|}\]<br />
<br />
where <math>n_{i,k}</math> is the number of times word i appears in document k, |D| is the total number of documents in the collection.<br />
<br />
===Inverse Document Frequency (IDF)===<br />
<br />
In this paper, IDF is calculated in a log scale. Since we will receive a large number of documents, i.e, we will have a large |D|<br />
<br />
The formula for IDF has the following form:<br />
<br />
\[IDF_{i} = log\left(\frac{|D|}{|\{d_k \in D: n_{i,k} > 0\}|}\right)\]<br />
<br />
As mentioned before, we will use HDFS. The actual formula applied is:<br />
<br />
\[IDF_{i} = log\left(\frac{|D|+1}{|\{d_k \in D: n_{i,k} > 0\}|+1}\right)\]<br />
<br />
The inverse document frequency gives a measure of how rare a certain term is in a given document corpus.<br />
<br />
=Paper Classification Using K-means Clustering=<br />
<br />
The K-means clustering is an unsupervised classification algorithm that groups similar data into the same class. It is an efficient and simple method that can work with different types of data attributes and is able to handle noise and outliers.<br />
<br><br />
<br />
Given a set of <math>d</math> by <math>n</math> dataset <math>\mathbf{X} = \left[ \mathbf{x}_1 \cdots \mathbf{x}_n \right]</math>, the algorithm will assign each <math>\mathbf{x}_j</math> into <math>k</math> different clusters based on the characteristics of <math>\mathbf{x}_j</math> itself.<br />
<br><br />
<br />
Moreover, when assigning data into a cluster, the algorithm will also try to minimise the distances between the data and the centre of the cluster which the data belongs to. That is, k-means clustering will minimise the sum of square error:<br />
<br />
\begin{align*}<br />
min \sum_{i=1}^{k} \sum_{j \in C_i} ||x_j - \mu_i||^2<br />
\end{align*}<br />
<br />
where<br />
<ul><br />
<li><math>k</math>: the number of clusters</li><br />
<li><math>C_i</math>: the <math>i^th</math> cluster</li><br />
<li><math>x_j</math>: the <math>j^th</math> data in the <math>C_i</math></li><br />
<li><math>mu_i</math>: the centroid of <math>C_i</math></li><br />
<li><math>||x_j - \mu_i||^2</math>: the Euclidean distance between <math>x_j</math> and <math>\mu_i</math></li><br />
</ul><br />
<br><br />
<br />
Since the goal for this paper is to classify research papers and group papers with similar topics based on keywords, the paper uses the K-means clustering algorithm. The algorithm first computes the cluster centroid for each group of papers with a specific topic. Then, it will assign a paper into a cluster based on the Euclidean distance between the cluster centroid and the paper’s TF-IDF value.<br />
<br><br />
<br />
However, different values of <math>k</math> (the number of clusters) will return different clustering results. Therefore, it is important to define the number of clusters before clustering. For example, in this paper, the authors choose to use the Elbow scheme to determine the value of <math>k</math>. The Elbow scheme is a somewhat subjective way of choosing an optimal <math>k</math> that involves plotting the average of the squared distances from the cluster centers of the respective clusters (distortion) as a function of <math>k</math> and choosing a <math>k</math> at which point the decrease in distortion is outweighed by the increase in complexity. Also, to measure the performance of clustering, the authors decide to use the Silhouette scheme. The results of clustering are validated if the Silhouette scheme returns a value greater than <math>0.5</math>.<br />
<br />
=System Testing Results=<br />
<br />
In this paper, the dataset has 3264 research papers from the Future Generation Computer System (FGCS) journal between 1984 and 2017. For constructing keyword dictionaries for each paper, the authors have introduced three methods as shown below:<br />
<br />
<div align="center">[[File:table_3_tmtckd.JPG|700px]]</div><br />
<br />
<br />
Then, the authors use the Elbow scheme to define the number of clusters for each method with different numbers of keywords before running the K-means clustering algorithm. The results are shown below:<br />
<br />
<div align="center">[[File:table_4_nocobes.JPG|700px]]</div><br />
<br />
According to Table 4, there is a positive correlation between the number of keywords and the number of clusters. In addition, method 3 combines the advantages for both method 1 and method 2; thus, method 3 requires the least clusters in total. On the other hand, the wrong keywords might be presented in papers; hence, it might not be possible to group papers with similar subjects correctly by using method 1 and so method 1 needs the most number of clusters in total.<br />
<br />
<br />
Next, the Silhouette scheme had been used for measuring the performance for clustering. The average of the Silhouette values for each method with different numbers of keywords are shown below:<br />
<br />
<div align="center">[[File:table_5_asv.JPG|700px]]</div><br />
<br />
Since the clustering is validated if the Silhouette’s value is greater than 0.5, for methods with 10 and 30 keywords, the K-means clustering algorithm produces good results.<br />
<br />
<br />
To evaluate the accuracy of the classification system in this paper, the authors use the F-Score. The authors execute 5 times of experiment and use 500 randomly selected research papers for each trial. The following histogram shows the average value of F-Score for the three methods and different numbers of keywords:<br />
<br />
<div align="center">[[File:fig_16_fsvotm.JPG|700px]]</div><br />
<br />
Note that “TFIDF” means method 1, “LDA” means method 2, and “TFIDF-LDA” means method 3. The number 10, 20, and 30 after each method is the number of keywords the method has used.<br />
According to the histogram above, method 3 has the highest F-Score values than the other two methods with different numbers of keywords. Therefore, the classification system is most accurate when using method 3 as it combines the advantages for both method 1 and method 2.<br />
<br />
=Conclusion=<br />
<br />
This paper introduces a classification system that classifies papers into different topics by using TF-IDF and LDA scheme with K-means clustering algorithm. This system allows users to search the papers they want quickly and with the most productivity.<br />
<br />
Furthermore, this classification system might be also used in different types of texts (e.g. documents, tweets, etc.) instead of only classifying research papers.<br />
<br />
=Critique=<br />
<br />
In this paper, DF values are calculated within each partition. This results that for each partition, DF value for a given word will vary and may have an inconsistent result for different partition methods. As mentioned above, there might be a divide by zero problem since some partitions do not have documents containing a given word, but this can be solved by introducing a dummy document as the authors did. Another method that might be better at solving inconsistent results and the divide by zero problems is to have all partitions to communicate with their DF value. Then pass the merged DF value to all partitions to do the final IDF and TF-IDF value. Having all partitions to communicate with the DF value will guarantee a consistent DF value across all partitions and helps avoid a divide by zero problem as words in the keyword dictionary must appear in some documents in the whole collection.<br />
<br />
This paper treated the words in the different parts of a document equivalently, it might perform better if it gives different weights to the same word in different parts. For example, if a word appears in the title of the document, it usually shows it's a main topic of this document so we can put more weight on it to categorize.<br />
<br />
When discussing the potential processing advantages of this classification system for other types of text samples, has the effect of processing mixed samples (text and image or text and video) taken into consideration? IF not, in terms of text classification only, does it have an overwhelming advantage over traditional classification models?<br />
<br />
The preprocessing should also include <math>n</math>-gram tokenization for topic modelling because some topics are inherently two words, such as machine learning where if it is seen separately, it implies different topics.<br />
<br />
=References=<br />
<br />
Blei DM, el. (2003). Latent Dirichlet allocation. J Mach Learn Res 3:993–1022<br />
<br />
Gil, JM, Kim, SW. (2019). Research paper classification systems based on TF-IDF and LDA schemes. ''Human-centric Computing and Information Sciences'', 9, 30. https://doi.org/10.1186/s13673-019-0192-7<br />
<br />
Liu, S. (2019, January 11). Dirichlet distribution Motivating LDA. Retrieved November 2020, from https://towardsdatascience.com/dirichlet-distribution-a82ab942a879<br />
<br />
Serrano, L. (Director). (2020, March 18). Latent Dirichlet Allocation (Part 1 of 2) [Video file]. Retrieved 2020, from https://www.youtube.com/watch?v=T05t-SqKArY</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Research_Papers_Classification_System&diff=48832Research Papers Classification System2020-12-02T05:25:44Z<p>Y93fang: /* Term Frequency (TF) */</p>
<hr />
<div>= Presented by =<br />
Jill Wang, Junyi (Jay) Yang, Yu Min (Chris) Wu, Chun Kit (Calvin) Li<br />
<br />
= Introduction =<br />
This paper introduces a paper classification system that utilizes the Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and K-means clustering. The most important technology the system used to process big data is the Hadoop Distributed File Systems (HDFS). The system can handle quantitatively complex research paper classification problems efficiently and accurately.<br />
<br />
===General Framework===<br />
<br />
The paper classification system classifies research papers based on the abstracts given that the core of most papers is presented in the abstracts. <br />
<br />
<ol><li>Paper Crawling <br />
<p>Collects abstracts from research papers published during a given period</p></li><br />
<li>Preprocessing<br />
<p> <ol style="list-style-type:lower-alpha"><li>Removes stop words in the papers crawled, in which only nouns are extracted from the papers</li><br />
<li>generates a keyword dictionary, keeping only the top-N keywords with the highest frequencies</li> </ol><br />
</p></li> <br />
<li>Topic Modelling<br />
<p> Use the LDA to group the keywords into topics</p><br />
</li><br />
<li>Paper Length Calculation<br />
<p> Calculates the total number of occurrences of words to prevent an unbalanced TF values caused by the various length of abstracts using the map-reduce algorithm</p><br />
</li><br />
<li>Word Frequency Calculation<br />
<p> Calculates the Term Frequency (TF) values which represent the frequency of keywords in a research paper</p><br />
</li><br />
<li>Document Frequency Calculation<br />
<p> Calculates the Document Frequency (DF) values which represents the frequency of keywords in a collection of research papers. The higher the DF value, the lower the importance of a keyword.</p><br />
</li><br />
<li>TF-IDF calculation<br />
<p> Calculates the inverse of the DF which represents the importance of a keyword.</p><br />
</li><br />
<li>Paper Classification<br />
<p> Classify papers by topics using the K-means clustering algorithm.</p><br />
</li><br />
</ol><br />
<br />
===Technologies===<br />
<br />
The HDFS with a Hadoop cluster composed of one master node, one sub node, and four data nodes is what is used to process the massive paper data. Hadoop-2.6.5 version in Java is what is used to perform the TF-IDF calculation. Spark MLlib is what is used to perform the LDA. The Scikit-learn library is what is used to perform the K-means clustering.<br />
<br />
===HDFS===<br />
<br />
Hadoop Distributed File Systems was used to process big data in this system. What Hadoop does is to break a big collection of data into different partitions and pass each partition to one individual processor. Each processor will only have information about the partition of data it received.<br />
<br />
'''In this summary, we are going to focus on introducing the main algorithms of what this system uses, namely LDA, TF-IDF, and K-Means.'''<br />
<br />
=Data Preprocessing=<br />
===Crawling of Abstract Data===<br />
<br />
Under the assumption that audiences tend to first read the abstract of a paper to gain an overall understanding of the material, it is reasonable to assume the abstract section includes “core words” that can be used to effectively classify a paper's subject.<br />
<br />
An abstract is crawled to have its stop words removed. Stop words are words that are usually ignored by search engines, such as “the”, “a”, and etc. Afterwards, nouns are extracted, as a more condensed representation for efficient analysis.<br />
<br />
This is managed on HDFS. The TF-IDF value of each paper is calculated through map-reduce.<br />
<br />
===Managing Paper Data===<br />
<br />
To construct an effective keyword dictionary using abstract data and keywords data in all of the crawled papers, the authors categorized keywords with similar meanings using a single representative keyword. The approach is called stemming, which is common in cleaning data. 1394 keyword categories are extracted, which is still too much to compute. Hence, only the top 30 keyword categories are used.<br />
<br />
<div align="center">[[File:table_1_kswf.JPG|700px]]</div><br />
<br />
=Topic Modeling Using LDA=<br />
<br />
Latent Dirichlet allocation (LDA) is a generative probabilistic model that views documents as random mixtures over latent topics. Each topic is a distribution over words, and the goal is to extract these topics from documents.<br />
<br />
LDA estimates the topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> using Dirichlet priors for the distributions with a fixed number of topics. For each document, obtain a feature vector:<br />
<br />
\[F = \left( P\left(z_1 | d\right), P\left(z_2 | d\right), \cdots, P\left(z_k | d\right) \right)\]<br />
<br />
In the paper, authors extract topics from preprocessed paper to generate three kinds of topic sets, each with 10, 20, and 30 topics respectively. The following is a table of the 10 topic sets of highest frequency keywords.<br />
<br />
<div align="center">[[File:table_2_tswtebls.JPG|700px]]</div><br />
<br />
<br />
===LDA Intuition===<br />
<br />
LDA uses the Dirichlet priors of the Dirichlet distribution. The following picture illustrates 2-simplex Dirichlet distributions with different alpha values, one for each corner of the triangles. <br />
<br />
<div align="center">[[File:dirichlet_dist.png|700px]]</div><br />
<br />
Simplex is a generalization of the notion of a triangle. In Dirichlet distribution, each parameter will be represented by a corner in simplex, so adding additional parameters implies increasing the dimensions of simplex. As illustrated, when alphas are smaller than 1 the distribution is dense at the corners. When the alphas are greater than 1 the distribution is dense at the centers.<br />
<br />
The following illustration shows an example LDA with 3 topics, 4 words and 7 documents.<br />
<br />
<div align="center">[[File:LDA_example.png|800px]]</div><br />
<br />
In the left diagram, there are three topics, hence it is a 2-simplex. In the right diagram there are four words, hence it is a 3-simplex. LDA essentially adjusts parameters in Dirichlet distributions and multinomial distributions (represented by the points), such that, in the left diagram, all the yellow points representing documents and, in the right diagram, all the points representing topics, are as close to a corner as possible. In other words, LDA finds topics for documents and also finds words for topics. At the end topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> are produced.<br />
<br />
=Term Frequency Inverse Document Frequency (TF-IDF) Calculation=<br />
<br />
TF-IDF is widely used to evaluate the importance of a set of words in the fields of information retrieval and text mining. It is a combination of term frequency (TF) and inverse document frequency (IDF). The idea behind this combination is<br />
It evaluates the importance of a word within a document, and<br />
It evaluates the importance of the word among the collection of all documents<br />
<br />
The TF-IDF formula has the following form:<br />
<br />
\[TF-IDF_{i,j} = TF_{i,j} \times IDF_{i}\]<br />
<br />
where i stands for the <math>i^{th}</math> word and j stands for the <math>j^{th}</math> document.<br />
<br />
===Term Frequency (TF)===<br />
<br />
TF evaluates the percentage of a given word in a document. Thus, TF value indicates the importance of a word. The TF has a positive relation with the importance.<br />
<br />
In this paper, we only calculate TF for words in the keyword dictionary obtained. For a given keyword i, <math>TF_{i,j}</math> is the number of times word i appears in document j divided by the total number of words in document j.<br />
<br />
The formula for TF has the following form:<br />
<br />
\[TF_{i,j} = \frac{n_{i,j} }{\sum_k n_{k,j} }\]<br />
<br />
where i stands for the <math>i^{th}</math> word, j stands for the <math>j^{th}</math> document, and <math>n_{i,j}</math> stands for the number of times words <math>t_i</math> appear in document <math>d_j</math> and <math>\sum_k n_{k,j} </math> stands for total number of occurence of words in document <math>d_j</math>.<br />
<br />
Note that the denominator is the total number of words remaining in document j after crawling.<br />
<br />
===Document Frequency (DF)===<br />
<br />
DF evaluates the percentage of documents that contain a given word over the entire collection of documents. Thus, the higher DF value is, the less important the word is. Since DF and the importance of the word have an inverse relation, we use IDF instead of DF.<br />
<br />
<math>DF_{i}</math> is the number of documents in the collection with word i divided by the total number of documents in the collection. The formula for DF has the following form:<br />
<br />
\[DF_{i} = \frac{|d_k \in D: n_{i,k} > 0|}{|D|}\]<br />
<br />
where <math>n_{i,k}</math> is the number of times word i appears in document k, |D| is the total number of documents in the collection.<br />
<br />
===Inverse Document Frequency (IDF)===<br />
<br />
In this paper, IDF is calculated in a log scale. Since we will receive a large number of documents, i.e, we will have a large |D|<br />
<br />
The formula for IDF has the following form:<br />
<br />
\[IDF_{i} = log\left(\frac{|D|}{|\{d_k \in D: n_{i,k} > 0\}|}\right)\]<br />
<br />
As mentioned before, we will use HDFS. The actual formula applied is:<br />
<br />
\[IDF_{i} = log\left(\frac{|D|+1}{|\{d_k \in D: n_{i,k} > 0\}|+1}\right)\]<br />
<br />
The inverse document frequency gives a measure of how rare a certain term is in a given document corpus.<br />
<br />
=Paper Classification Using K-means Clustering=<br />
<br />
The K-means clustering is an unsupervised classification algorithm that groups similar data into the same class. It is an efficient and simple method that can work with different types of data attributes and is able to handle noise and outliers.<br />
<br><br />
<br />
Given a set of <math>d</math> by <math>n</math> dataset <math>\mathbf{X} = \left[ \mathbf{x}_1 \cdots \mathbf{x}_n \right]</math>, the algorithm will assign each <math>\mathbf{x}_j</math> into <math>k</math> different clusters based on the characteristics of <math>\mathbf{x}_j</math> itself.<br />
<br><br />
<br />
Moreover, when assigning data into a cluster, the algorithm will also try to minimise the distances between the data and the centre of the cluster which the data belongs to. That is, k-means clustering will minimise the sum of square error:<br />
<br />
\begin{align*}<br />
min \sum_{i=1}^{k} \sum_{j \in C_i} ||x_j - \mu_i||^2<br />
\end{align*}<br />
<br />
where<br />
<ul><br />
<li><math>k</math>: the number of clusters</li><br />
<li><math>C_i</math>: the <math>i^th</math> cluster</li><br />
<li><math>x_j</math>: the <math>j^th</math> data in the <math>C_i</math></li><br />
<li><math>mu_i</math>: the centroid of <math>C_i</math></li><br />
<li><math>||x_j - \mu_i||^2</math>: the Euclidean distance between <math>x_j</math> and <math>\mu_i</math></li><br />
</ul><br />
<br><br />
<br />
Since the goal for this paper is to classify research papers and group papers with similar topics based on keywords, the paper uses the K-means clustering algorithm. The algorithm first computes the cluster centroid for each group of papers with a specific topic. Then, it will assign a paper into a cluster based on the Euclidean distance between the cluster centroid and the paper’s TF-IDF value.<br />
<br><br />
<br />
However, different values of <math>k</math> (the number of clusters) will return different clustering results. Therefore, it is important to define the number of clusters before clustering. For example, in this paper, the authors choose to use the Elbow scheme to determine the value of <math>k</math>. The Elbow scheme is a somewhat subjective way of choosing an optimal <math>k</math> that involves plotting the average of the squared distances from the cluster centers of the respective clusters (distortion) as a function of <math>k</math> and choosing a <math>k</math> at which point the decrease in distortion is outweighed by the increase in complexity. Also, to measure the performance of clustering, the authors decide to use the Silhouette scheme. The results of clustering are validated if the Silhouette scheme returns a value greater than <math>0.5</math>.<br />
<br />
=System Testing Results=<br />
<br />
In this paper, the dataset has 3264 research papers from the Future Generation Computer System (FGCS) journal between 1984 and 2017. For constructing keyword dictionaries for each paper, the authors have introduced three methods as shown below:<br />
<br />
<div align="center">[[File:table_3_tmtckd.JPG|700px]]</div><br />
<br />
<br />
Then, the authors use the Elbow scheme to define the number of clusters for each method with different numbers of keywords before running the K-means clustering algorithm. The results are shown below:<br />
<br />
<div align="center">[[File:table_4_nocobes.JPG|700px]]</div><br />
<br />
According to Table 4, there is a positive correlation between the number of keywords and the number of clusters. In addition, method 3 combines the advantages for both method 1 and method 2; thus, method 3 requires the least clusters in total. On the other hand, the wrong keywords might be presented in papers; hence, it might not be possible to group papers with similar subjects correctly by using method 1 and so method 1 needs the most number of clusters in total.<br />
<br />
<br />
Next, the Silhouette scheme had been used for measuring the performance for clustering. The average of the Silhouette values for each method with different numbers of keywords are shown below:<br />
<br />
<div align="center">[[File:table_5_asv.JPG|700px]]</div><br />
<br />
Since the clustering is validated if the Silhouette’s value is greater than 0.5, for methods with 10 and 30 keywords, the K-means clustering algorithm produces good results.<br />
<br />
<br />
To evaluate the accuracy of the classification system in this paper, the authors use the F-Score. The authors execute 5 times of experiment and use 500 randomly selected research papers for each trial. The following histogram shows the average value of F-Score for the three methods and different numbers of keywords:<br />
<br />
<div align="center">[[File:fig_16_fsvotm.JPG|700px]]</div><br />
<br />
Note that “TFIDF” means method 1, “LDA” means method 2, and “TFIDF-LDA” means method 3. The number 10, 20, and 30 after each method is the number of keywords the method has used.<br />
According to the histogram above, method 3 has the highest F-Score values than the other two methods with different numbers of keywords. Therefore, the classification system is most accurate when using method 3 as it combines the advantages for both method 1 and method 2.<br />
<br />
=Conclusion=<br />
<br />
This paper introduces a classification system that classifies papers into different topics by using TF-IDF and LDA scheme with K-means clustering algorithm. This system allows users to search the papers they want quickly and with the most productivity.<br />
<br />
Furthermore, this classification system might be also used in different types of texts (e.g. documents, tweets, etc.) instead of only classifying research papers.<br />
<br />
=Critique=<br />
<br />
In this paper, DF values are calculated within each partition. This results that for each partition, DF value for a given word will vary and may have an inconsistent result for different partition methods. As mentioned above, there might be a divide by zero problem since some partitions do not have documents containing a given word, but this can be solved by introducing a dummy document as the authors did. Another method that might be better at solving inconsistent results and the divide by zero problems is to have all partitions to communicate with their DF value. Then pass the merged DF value to all partitions to do the final IDF and TF-IDF value. Having all partitions to communicate with the DF value will guarantee a consistent DF value across all partitions and helps avoid a divide by zero problem as words in the keyword dictionary must appear in some documents in the whole collection.<br />
<br />
This paper treated the words in the different parts of a document equivalently, it might perform better if it gives different weights to the same word in different parts. For example, if a word appears in the title of the document, it usually shows it's a main topic of this document so we can put more weight on it to categorize.<br />
<br />
When discussing the potential processing advantages of this classification system for other types of text samples, has the effect of processing mixed samples (text and image or text and video) taken into consideration? IF not, in terms of text classification only, does it have an overwhelming advantage over traditional classification models?<br />
<br />
The preprocessing should also include <math>n</math>-gram tokenization for topic modelling because some topics are inherently two words, such as machine learning where if it is seen separately, it implies different topics.<br />
<br />
=References=<br />
<br />
Blei DM, el. (2003). Latent Dirichlet allocation. J Mach Learn Res 3:993–1022<br />
<br />
Gil, JM, Kim, SW. (2019). Research paper classification systems based on TF-IDF and LDA schemes. ''Human-centric Computing and Information Sciences'', 9, 30. https://doi.org/10.1186/s13673-019-0192-7<br />
<br />
Liu, S. (2019, January 11). Dirichlet distribution Motivating LDA. Retrieved November 2020, from https://towardsdatascience.com/dirichlet-distribution-a82ab942a879<br />
<br />
Serrano, L. (Director). (2020, March 18). Latent Dirichlet Allocation (Part 1 of 2) [Video file]. Retrieved 2020, from https://www.youtube.com/watch?v=T05t-SqKArY</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN&diff=48827Mask RCNN2020-12-02T04:53:55Z<p>Y93fang: /* Model Architecture */</p>
<hr />
<div>== Presented by == <br />
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang<br />
<br />
== Introduction == <br />
Mask RCNN [1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision which is important when attempting to identify different objects within the same image. <br />
Mask R-CNN extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.<br />
<br />
== Visual Perception Tasks == <br />
<br />
Figure 1 shows a visual representation of different types of visual perception tasks:<br />
<br />
- Image Classification: Predict a set of labels to characterize the contents of an input image<br />
<br />
- Object Detection: Build on image classification but localize each object in an image<br />
<br />
- Semantic Segmentation: Associate every pixel in an input image with a class label<br />
<br />
- Instance Segmentation: Associate every pixel in an input image to a specific object<br />
<br />
[[File:instance segmentation.png | center]]<br />
<div align="center">Figure 1: Visual Perception tasks</div><br />
<br />
<br />
Mask RCNN is a deep neural network architecture for Instance Segmentation.<br />
<br />
== Related Work == <br />
Region Proposal Network: A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.<br />
<br />
ROI Pooling: The main use of ROI Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section.<br />
<br />
Faster R-CNN: Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network, proposes candidate object bounding boxes. <br />
The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.<br />
<br />
[[File:FasterRCNN.png | center]]<br />
<div align="center">Figure 2: Faster RCNN architecture</div><br />
<br />
<br />
ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is actually a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.<br />
<br />
[[File:ResNetFPN.png | center]]<br />
<div align="center">Figure 3: ResNetFPN architecture</div><br />
<br />
== Model Architecture == <br />
The structure of mask R-CNN is quite similar to the structure of faster R-CNN. <br />
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI as <math>L=L_{cls}+L_{box}+L_{mask}</math>, where <math>L_{cls}</math>, <math>L_{box}</math>, <math>L_{mask}</math> represent the classification loss, bounding box loss and the average binary cross-entropy loss respectively.<br />
<br />
The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, complete R-CNN pipeline stages involve: 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stages 3 and 4 are adjusted to simplify the network procedures.<br />
<br />
The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.<br />
One thing worth noticing is that for other network systems, those masks across classes compete with each other, but in this particular case, with a <br />
per-pixel sigmoid and a binary loss the masks across classes no longer compete, which makes this formula the key for good instance segmentation results.<br />
<br />
Another important concept involved is called RoIAlign. This concept is useful in stage 2 where the RoIPool extracts <br />
features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to pixel-to-pixel correspondence. The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. Also, instead of quantization, the coordinates are computed using bilinear interpolation to guarantee spatial correspondence.<br />
<br />
The network architectures utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction. <br />
<br />
There are some implementation details that should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.<br />
<br />
== Results ==<br />
[[File:ExpInstanceSeg.png | center]]<br />
<div align="center">Figure 4: Instance Segmentation Experiments</div><br />
<br />
Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of art model <br />
<br />
[[File:BoundingBoxExp.png | center]]<br />
<div align="center">Figure 5: Bounding Box Detection Experiments</div><br />
<br />
Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.<br />
<br />
<br />
== Ablation Experiments ==<br />
[[File:BackboneExp.png | center]]<br />
<div align="center">Figure 6: Backbone Architecture Experiments</div><br />
<br />
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. <br />
<br />
[[File:MultiVSInde.png | center]]<br />
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div><br />
<br />
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).<br />
<br />
[[File: RoIAlign.png | center]]<br />
<div align="center">Figure 8: RoIAlign Experiments 1</div><br />
<br />
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers. <br />
<br />
[[File: RoIAlignExp.png | center]]<br />
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div><br />
<br />
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.<br />
<br />
[[File:MaskBranchExp.png | center]]<br />
<div align="center">Figure 10: Mask Branch Experiments</div><br />
<br />
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.<br />
<br />
== Human Pose Estimation ==<br />
Mask RCNN can be extended to human pose estimation.<br />
<br />
The simple approach the paper presents is to model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types such as left shoulder, right elbow. <br />
<br />
[[File:HumanPose.png | center]]<br />
<div align="center">Figure 11: Keypoint Detection Results</div><br />
<br />
== Conclusion ==<br />
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.<br />
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.<br />
<br />
== Critiques ==<br />
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.<br />
<br />
It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.<br />
<br />
The paper lacks the comparisons of different methods and Mask RNN on unlabeled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.<br />
<br />
The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.<br />
<br />
The Mask RCNN could be a candidate model to do short term prediction on physical behaviors of a person, which could be very useful at crime scenes.<br />
<br />
An interesting application of Mask RCNN would be on face recognition from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.<br />
<br />
The main problems for CNN architectures like Mask RCNN is the running time. Due to slow running times, Single Shot Detector algorithms are preferred for applications like video or live stream detections, where a faster running time would mean a better response to changes in frames. It would be beneficial to have a graphical representation of the Mask RCNN running times against single shot detector algorithms such as YOLOv3.<br />
<br />
It is interesting to investigate a solution of embedding instance segmentation with semantic segmentation in order to improve time performance. Because in many situations, knowing the exact boundary of an object is not necessary.<br />
<br />
It will be better if we can have more comparisons with other models. It will also be nice if we can have more details about why Mask RCNN can perform better, and how about the efficiency of it?<br />
<br />
== References ==<br />
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.<br />
<br />
[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.<br />
<br />
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN&diff=48826Mask RCNN2020-12-02T04:53:11Z<p>Y93fang: /* Model Architecture */</p>
<hr />
<div>== Presented by == <br />
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang<br />
<br />
== Introduction == <br />
Mask RCNN [1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision which is important when attempting to identify different objects within the same image. <br />
Mask R-CNN extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.<br />
<br />
== Visual Perception Tasks == <br />
<br />
Figure 1 shows a visual representation of different types of visual perception tasks:<br />
<br />
- Image Classification: Predict a set of labels to characterize the contents of an input image<br />
<br />
- Object Detection: Build on image classification but localize each object in an image<br />
<br />
- Semantic Segmentation: Associate every pixel in an input image with a class label<br />
<br />
- Instance Segmentation: Associate every pixel in an input image to a specific object<br />
<br />
[[File:instance segmentation.png | center]]<br />
<div align="center">Figure 1: Visual Perception tasks</div><br />
<br />
<br />
Mask RCNN is a deep neural network architecture for Instance Segmentation.<br />
<br />
== Related Work == <br />
Region Proposal Network: A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.<br />
<br />
ROI Pooling: The main use of ROI Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section.<br />
<br />
Faster R-CNN: Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network, proposes candidate object bounding boxes. <br />
The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.<br />
<br />
[[File:FasterRCNN.png | center]]<br />
<div align="center">Figure 2: Faster RCNN architecture</div><br />
<br />
<br />
ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is actually a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.<br />
<br />
[[File:ResNetFPN.png | center]]<br />
<div align="center">Figure 3: ResNetFPN architecture</div><br />
<br />
== Model Architecture == <br />
The structure of mask R-CNN is quite similar to the structure of faster R-CNN. <br />
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI as <math>L=L_{cls}+L_{box}+L_{mask}</math> where <math>L_{cls}</math>,<math>L_{box}</math>, <math>L_{mask}</math> represent the classification loss, bounding box loss and the average binary cross-entropy loss respectively.<br />
<br />
The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, complete R-CNN pipeline stages involve: 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stages 3 and 4 are adjusted to simplify the network procedures.<br />
<br />
The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.<br />
One thing worth noticing is that for other network systems, those masks across classes compete with each other, but in this particular case, with a <br />
per-pixel sigmoid and a binary loss the masks across classes no longer compete, which makes this formula the key for good instance segmentation results.<br />
<br />
Another important concept involved is called RoIAlign. This concept is useful in stage 2 where the RoIPool extracts <br />
features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to pixel-to-pixel correspondence. The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. Also, instead of quantization, the coordinates are computed using bilinear interpolation to guarantee spatial correspondence.<br />
<br />
The network architectures utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction. <br />
<br />
There are some implementation details that should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.<br />
<br />
== Results ==<br />
[[File:ExpInstanceSeg.png | center]]<br />
<div align="center">Figure 4: Instance Segmentation Experiments</div><br />
<br />
Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of art model <br />
<br />
[[File:BoundingBoxExp.png | center]]<br />
<div align="center">Figure 5: Bounding Box Detection Experiments</div><br />
<br />
Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.<br />
<br />
<br />
== Ablation Experiments ==<br />
[[File:BackboneExp.png | center]]<br />
<div align="center">Figure 6: Backbone Architecture Experiments</div><br />
<br />
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. <br />
<br />
[[File:MultiVSInde.png | center]]<br />
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div><br />
<br />
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).<br />
<br />
[[File: RoIAlign.png | center]]<br />
<div align="center">Figure 8: RoIAlign Experiments 1</div><br />
<br />
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers. <br />
<br />
[[File: RoIAlignExp.png | center]]<br />
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div><br />
<br />
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.<br />
<br />
[[File:MaskBranchExp.png | center]]<br />
<div align="center">Figure 10: Mask Branch Experiments</div><br />
<br />
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.<br />
<br />
== Human Pose Estimation ==<br />
Mask RCNN can be extended to human pose estimation.<br />
<br />
The simple approach the paper presents is to model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types such as left shoulder, right elbow. <br />
<br />
[[File:HumanPose.png | center]]<br />
<div align="center">Figure 11: Keypoint Detection Results</div><br />
<br />
== Conclusion ==<br />
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.<br />
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.<br />
<br />
== Critiques ==<br />
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.<br />
<br />
It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.<br />
<br />
The paper lacks the comparisons of different methods and Mask RNN on unlabeled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.<br />
<br />
The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.<br />
<br />
The Mask RCNN could be a candidate model to do short term prediction on physical behaviors of a person, which could be very useful at crime scenes.<br />
<br />
An interesting application of Mask RCNN would be on face recognition from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.<br />
<br />
The main problems for CNN architectures like Mask RCNN is the running time. Due to slow running times, Single Shot Detector algorithms are preferred for applications like video or live stream detections, where a faster running time would mean a better response to changes in frames. It would be beneficial to have a graphical representation of the Mask RCNN running times against single shot detector algorithms such as YOLOv3.<br />
<br />
It is interesting to investigate a solution of embedding instance segmentation with semantic segmentation in order to improve time performance. Because in many situations, knowing the exact boundary of an object is not necessary.<br />
<br />
It will be better if we can have more comparisons with other models. It will also be nice if we can have more details about why Mask RCNN can perform better, and how about the efficiency of it?<br />
<br />
== References ==<br />
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.<br />
<br />
[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.<br />
<br />
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN&diff=48825Mask RCNN2020-12-02T04:48:02Z<p>Y93fang: /* Model Architecture */</p>
<hr />
<div>== Presented by == <br />
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang<br />
<br />
== Introduction == <br />
Mask RCNN [1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision which is important when attempting to identify different objects within the same image. <br />
Mask R-CNN extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.<br />
<br />
== Visual Perception Tasks == <br />
<br />
Figure 1 shows a visual representation of different types of visual perception tasks:<br />
<br />
- Image Classification: Predict a set of labels to characterize the contents of an input image<br />
<br />
- Object Detection: Build on image classification but localize each object in an image<br />
<br />
- Semantic Segmentation: Associate every pixel in an input image with a class label<br />
<br />
- Instance Segmentation: Associate every pixel in an input image to a specific object<br />
<br />
[[File:instance segmentation.png | center]]<br />
<div align="center">Figure 1: Visual Perception tasks</div><br />
<br />
<br />
Mask RCNN is a deep neural network architecture for Instance Segmentation.<br />
<br />
== Related Work == <br />
Region Proposal Network: A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.<br />
<br />
ROI Pooling: The main use of ROI Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section.<br />
<br />
Faster R-CNN: Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network, proposes candidate object bounding boxes. <br />
The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.<br />
<br />
[[File:FasterRCNN.png | center]]<br />
<div align="center">Figure 2: Faster RCNN architecture</div><br />
<br />
<br />
ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is actually a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.<br />
<br />
[[File:ResNetFPN.png | center]]<br />
<div align="center">Figure 3: ResNetFPN architecture</div><br />
<br />
== Model Architecture == <br />
The structure of mask R-CNN is quite similar to the structure of faster R-CNN. <br />
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI as <math>L_{cls}+L_{box}+L_{mask}</math>.<br />
<br />
The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, complete R-CNN pipeline stages involve: 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stages 3 and 4 are adjusted to simplify the network procedures.<br />
<br />
The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.<br />
One thing worth noticing is that for other network systems, those masks across classes compete with each other, but in this particular case, with a <br />
per-pixel sigmoid and a binary loss the masks across classes no longer compete, which makes this formula the key for good instance segmentation results.<br />
<br />
Another important concept involved is called RoIAlign. This concept is useful in stage 2 where the RoIPool extracts <br />
features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to pixel-to-pixel correspondence. The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. Also, instead of quantization, the coordinates are computed using bilinear interpolation to guarantee spatial correspondence.<br />
<br />
The network architectures utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction. <br />
<br />
There are some implementation details that should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.<br />
<br />
== Results ==<br />
[[File:ExpInstanceSeg.png | center]]<br />
<div align="center">Figure 4: Instance Segmentation Experiments</div><br />
<br />
Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of art model <br />
<br />
[[File:BoundingBoxExp.png | center]]<br />
<div align="center">Figure 5: Bounding Box Detection Experiments</div><br />
<br />
Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.<br />
<br />
<br />
== Ablation Experiments ==<br />
[[File:BackboneExp.png | center]]<br />
<div align="center">Figure 6: Backbone Architecture Experiments</div><br />
<br />
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. <br />
<br />
[[File:MultiVSInde.png | center]]<br />
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div><br />
<br />
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).<br />
<br />
[[File: RoIAlign.png | center]]<br />
<div align="center">Figure 8: RoIAlign Experiments 1</div><br />
<br />
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers. <br />
<br />
[[File: RoIAlignExp.png | center]]<br />
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div><br />
<br />
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.<br />
<br />
[[File:MaskBranchExp.png | center]]<br />
<div align="center">Figure 10: Mask Branch Experiments</div><br />
<br />
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.<br />
<br />
== Human Pose Estimation ==<br />
Mask RCNN can be extended to human pose estimation.<br />
<br />
The simple approach the paper presents is to model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types such as left shoulder, right elbow. <br />
<br />
[[File:HumanPose.png | center]]<br />
<div align="center">Figure 11: Keypoint Detection Results</div><br />
<br />
== Conclusion ==<br />
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.<br />
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.<br />
<br />
== Critiques ==<br />
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.<br />
<br />
It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.<br />
<br />
The paper lacks the comparisons of different methods and Mask RNN on unlabeled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.<br />
<br />
The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.<br />
<br />
The Mask RCNN could be a candidate model to do short term prediction on physical behaviors of a person, which could be very useful at crime scenes.<br />
<br />
An interesting application of Mask RCNN would be on face recognition from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.<br />
<br />
The main problems for CNN architectures like Mask RCNN is the running time. Due to slow running times, Single Shot Detector algorithms are preferred for applications like video or live stream detections, where a faster running time would mean a better response to changes in frames. It would be beneficial to have a graphical representation of the Mask RCNN running times against single shot detector algorithms such as YOLOv3.<br />
<br />
It is interesting to investigate a solution of embedding instance segmentation with semantic segmentation in order to improve time performance. Because in many situations, knowing the exact boundary of an object is not necessary.<br />
<br />
It will be better if we can have more comparisons with other models. It will also be nice if we can have more details about why Mask RCNN can perform better, and how about the efficiency of it?<br />
<br />
== References ==<br />
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.<br />
<br />
[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.<br />
<br />
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN&diff=48824Mask RCNN2020-12-02T04:46:35Z<p>Y93fang: /* Model Architecture */</p>
<hr />
<div>== Presented by == <br />
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang<br />
<br />
== Introduction == <br />
Mask RCNN [1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision which is important when attempting to identify different objects within the same image. <br />
Mask R-CNN extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.<br />
<br />
== Visual Perception Tasks == <br />
<br />
Figure 1 shows a visual representation of different types of visual perception tasks:<br />
<br />
- Image Classification: Predict a set of labels to characterize the contents of an input image<br />
<br />
- Object Detection: Build on image classification but localize each object in an image<br />
<br />
- Semantic Segmentation: Associate every pixel in an input image with a class label<br />
<br />
- Instance Segmentation: Associate every pixel in an input image to a specific object<br />
<br />
[[File:instance segmentation.png | center]]<br />
<div align="center">Figure 1: Visual Perception tasks</div><br />
<br />
<br />
Mask RCNN is a deep neural network architecture for Instance Segmentation.<br />
<br />
== Related Work == <br />
Region Proposal Network: A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.<br />
<br />
ROI Pooling: The main use of ROI Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section.<br />
<br />
Faster R-CNN: Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network, proposes candidate object bounding boxes. <br />
The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.<br />
<br />
[[File:FasterRCNN.png | center]]<br />
<div align="center">Figure 2: Faster RCNN architecture</div><br />
<br />
<br />
ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is actually a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.<br />
<br />
[[File:ResNetFPN.png | center]]<br />
<div align="center">Figure 3: ResNetFPN architecture</div><br />
<br />
== Model Architecture == <br />
The structure of mask R-CNN is quite similar to the structure of faster R-CNN. <br />
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI as <math>L_{cls}+L_{box}+L_{mask}<\math>.<br />
<br />
The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, complete R-CNN pipeline stages involve: 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stages 3 and 4 are adjusted to simplify the network procedures.<br />
<br />
The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.<br />
One thing worth noticing is that for other network systems, those masks across classes compete with each other, but in this particular case, with a <br />
per-pixel sigmoid and a binary loss the masks across classes no longer compete, which makes this formula the key for good instance segmentation results.<br />
<br />
Another important concept involved is called RoIAlign. This concept is useful in stage 2 where the RoIPool extracts <br />
features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to pixel-to-pixel correspondence. The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. Also, instead of quantization, the coordinates are computed using bilinear interpolation to guarantee spatial correspondence.<br />
<br />
The network architectures utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction. <br />
<br />
There are some implementation details that should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.<br />
<br />
== Results ==<br />
[[File:ExpInstanceSeg.png | center]]<br />
<div align="center">Figure 4: Instance Segmentation Experiments</div><br />
<br />
Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of art model <br />
<br />
[[File:BoundingBoxExp.png | center]]<br />
<div align="center">Figure 5: Bounding Box Detection Experiments</div><br />
<br />
Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.<br />
<br />
<br />
== Ablation Experiments ==<br />
[[File:BackboneExp.png | center]]<br />
<div align="center">Figure 6: Backbone Architecture Experiments</div><br />
<br />
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. <br />
<br />
[[File:MultiVSInde.png | center]]<br />
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div><br />
<br />
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).<br />
<br />
[[File: RoIAlign.png | center]]<br />
<div align="center">Figure 8: RoIAlign Experiments 1</div><br />
<br />
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers. <br />
<br />
[[File: RoIAlignExp.png | center]]<br />
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div><br />
<br />
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.<br />
<br />
[[File:MaskBranchExp.png | center]]<br />
<div align="center">Figure 10: Mask Branch Experiments</div><br />
<br />
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.<br />
<br />
== Human Pose Estimation ==<br />
Mask RCNN can be extended to human pose estimation.<br />
<br />
The simple approach the paper presents is to model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types such as left shoulder, right elbow. <br />
<br />
[[File:HumanPose.png | center]]<br />
<div align="center">Figure 11: Keypoint Detection Results</div><br />
<br />
== Conclusion ==<br />
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.<br />
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.<br />
<br />
== Critiques ==<br />
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.<br />
<br />
It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.<br />
<br />
The paper lacks the comparisons of different methods and Mask RNN on unlabeled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.<br />
<br />
The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.<br />
<br />
The Mask RCNN could be a candidate model to do short term prediction on physical behaviors of a person, which could be very useful at crime scenes.<br />
<br />
An interesting application of Mask RCNN would be on face recognition from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.<br />
<br />
The main problems for CNN architectures like Mask RCNN is the running time. Due to slow running times, Single Shot Detector algorithms are preferred for applications like video or live stream detections, where a faster running time would mean a better response to changes in frames. It would be beneficial to have a graphical representation of the Mask RCNN running times against single shot detector algorithms such as YOLOv3.<br />
<br />
It is interesting to investigate a solution of embedding instance segmentation with semantic segmentation in order to improve time performance. Because in many situations, knowing the exact boundary of an object is not necessary.<br />
<br />
It will be better if we can have more comparisons with other models. It will also be nice if we can have more details about why Mask RCNN can perform better, and how about the efficiency of it?<br />
<br />
== References ==<br />
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.<br />
<br />
[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.<br />
<br />
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN&diff=48816Mask RCNN2020-12-02T04:07:32Z<p>Y93fang: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang<br />
<br />
== Introduction == <br />
Mask RCNN [1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision which is important when attempting to identify different objects within the same image. <br />
Mask R-CNN extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.<br />
<br />
== Visual Perception Tasks == <br />
<br />
Figure 1 shows a visual representation of different types of visual perception tasks:<br />
<br />
- Image Classification: Predict a set of labels to characterize the contents of an input image<br />
<br />
- Object Detection: Build on image classification but localize each object in an image<br />
<br />
- Semantic Segmentation: Associate every pixel in an input image with a class label<br />
<br />
- Instance Segmentation: Associate every pixel in an input image to a specific object<br />
<br />
[[File:instance segmentation.png | center]]<br />
<div align="center">Figure 1: Visual Perception tasks</div><br />
<br />
<br />
Mask RCNN is a deep neural network architecture for Instance Segmentation.<br />
<br />
== Related Work == <br />
Region Proposal Network: A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.<br />
<br />
ROI Pooling: The main use of ROI Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section.<br />
<br />
Faster R-CNN: Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network, proposes candidate object bounding boxes. <br />
The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.<br />
<br />
[[File:FasterRCNN.png | center]]<br />
<div align="center">Figure 2: Faster RCNN architecture</div><br />
<br />
<br />
ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is actually a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.<br />
<br />
[[File:ResNetFPN.png | center]]<br />
<div align="center">Figure 3: ResNetFPN architecture</div><br />
<br />
== Model Architecture == <br />
The structure of mask R-CNN is quite similar to the structure of faster R-CNN. <br />
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI.<br />
<br />
The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, complete R-CNN pipeline stages involve: 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stages 3 and 4 are adjusted to simplify the network procedures.<br />
<br />
The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.<br />
One thing worth noticing is that for other network systems, those masks across classes compete with each other, but in this particular case, with a <br />
per-pixel sigmoid and a binary loss the masks across classes no longer compete, which makes this formula the key for good instance segmentation results.<br />
<br />
Another important concept involved is called RoIAlign. This concept is useful in stage 2 where the RoIPool extracts <br />
features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to pixel-to-pixel correspondence. The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. Also, instead of quantization, the coordinates are computed using bilinear interpolation to guarantee spatial correspondence.<br />
<br />
The network architectures utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction. <br />
<br />
There are some implementation details that should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.<br />
<br />
== Results ==<br />
[[File:ExpInstanceSeg.png | center]]<br />
<div align="center">Figure 4: Instance Segmentation Experiments</div><br />
<br />
Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of art model <br />
<br />
[[File:BoundingBoxExp.png | center]]<br />
<div align="center">Figure 5: Bounding Box Detection Experiments</div><br />
<br />
Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.<br />
<br />
<br />
== Ablation Experiments ==<br />
[[File:BackboneExp.png | center]]<br />
<div align="center">Figure 6: Backbone Architecture Experiments</div><br />
<br />
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. <br />
<br />
[[File:MultiVSInde.png | center]]<br />
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div><br />
<br />
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).<br />
<br />
[[File: RoIAlign.png | center]]<br />
<div align="center">Figure 8: RoIAlign Experiments 1</div><br />
<br />
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers. <br />
<br />
[[File: RoIAlignExp.png | center]]<br />
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div><br />
<br />
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.<br />
<br />
[[File:MaskBranchExp.png | center]]<br />
<div align="center">Figure 10: Mask Branch Experiments</div><br />
<br />
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.<br />
<br />
== Human Pose Estimation ==<br />
Mask RCNN can be extended to human pose estimation.<br />
<br />
The simple approach the paper presents is to model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types such as left shoulder, right elbow. <br />
<br />
[[File:HumanPose.png | center]]<br />
<div align="center">Figure 11: Keypoint Detection Results</div><br />
<br />
== Conclusion ==<br />
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.<br />
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.<br />
<br />
== Critiques ==<br />
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.<br />
<br />
It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.<br />
<br />
The paper lacks the comparisons of different methods and Mask RNN on unlabeled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.<br />
<br />
The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.<br />
<br />
The Mask RCNN could be a candidate model to do short term prediction on physical behaviors of a person, which could be very useful at crime scenes.<br />
<br />
An interesting application of Mask RCNN would be on face recognition from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.<br />
<br />
The main problems for CNN architectures like Mask RCNN is the running time. Due to slow running times, Single Shot Detector algorithms are preferred for applications like video or live stream detections, where a faster running time would mean a better response to changes in frames. It would be beneficial to have a graphical representation of the Mask RCNN running times against single shot detector algorithms such as YOLOv3.<br />
<br />
It is interesting to investigate a solution of embedding instance segmentation with semantic segmentation in order to improve time performance. Because in many situations, knowing the exact boundary of an object is not necessary.<br />
<br />
It will be better if we can have more comparisons with other models. It will also be nice if we can have more details about why Mask RCNN can perform better, and how about the efficiency of it?<br />
<br />
== References ==<br />
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.<br />
<br />
[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.<br />
<br />
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=46480Task Understanding from Confusing Multi-task Data2020-11-25T21:56:32Z<p>Y93fang: /* Description of the Problem */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, and even self-driving cars. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates identifies an input to its respective task and the latter finds the relationship between the input and its label. See figure 1(b). To train a network of CSL, CSL-Net is constructed for representing CSL’s variables. However, this structure cannot be optimized by gradient back-propagation. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization. <br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine sigh complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. In multi-task learning, the task to which every sample belongs is known. With this task definition, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. Without this annotation, our goal is to understand the task concept from confusing input-label pairs. Overall, It requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, samples are generated from multiple distributions instead of one distribution. Thus, latent variable learning is insufficient to solve the research problem, which is multi-task confusing samples.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, let <math> (x,y)</math> be the training samples from <math>y=f(x)</math>, which is an identical but unknown mapping relationship. Assuming the risk measure is mean squared error (MSE), the expected risk functional is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the prior distribution of the input variable <math>x</math>. In practice, model optimizations are performed using the empirical risk<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk functional can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to the output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to recover both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of the MNIST digit data that has been colored. Each observation in this modified set consists of a colored image (<math>x_i</math>) and either the color, or the digit it represents (<math>y_i</math>). The goal is to recover the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': This data set has more observations than the "colored-MNIST" data and consists of pictures labeled with either the “Gender”, “Category”, and “Color” of the clothing item.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': In order to "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labeling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
<br />
Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reaction from text.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, there are some limitations that can be improved for future work:<br />
The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; The current algorithm is difficult to learn basic features directly through the CNN structure, and it is necessary to learn and train a fully connected network based on CNN features in the experiment.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed for converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=46479Task Understanding from Confusing Multi-task Data2020-11-25T21:56:13Z<p>Y93fang: /* Confusing Supervised Learning */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, and even self-driving cars. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates identifies an input to its respective task and the latter finds the relationship between the input and its label. See figure 1(b). To train a network of CSL, CSL-Net is constructed for representing CSL’s variables. However, this structure cannot be optimized by gradient back-propagation. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization. <br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine sigh complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. In multi-task learning, the task to which every sample belongs is known. With this task definition, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. Without this annotation, our goal is to understand the task concept from confusing input-label pairs. Overall, It requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, samples are generated from multiple distributions instead of one distribution. Thus, latent variable learning is insufficient to solve the research problem, which is multi-task confusing samples.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, let <math> (x,y)<math> be the training samples from <math>y=f(x)</math>, which is an identical but unknown mapping relationship. Assuming the risk measure is mean squared error (MSE), the expected risk functional is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the prior distribution of the input variable <math>x</math>. In practice, model optimizations are performed using the empirical risk<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk functional can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to the output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to recover both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of the MNIST digit data that has been colored. Each observation in this modified set consists of a colored image (<math>x_i</math>) and either the color, or the digit it represents (<math>y_i</math>). The goal is to recover the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': This data set has more observations than the "colored-MNIST" data and consists of pictures labeled with either the “Gender”, “Category”, and “Color” of the clothing item.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': In order to "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labeling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
<br />
Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reaction from text.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, there are some limitations that can be improved for future work:<br />
The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; The current algorithm is difficult to learn basic features directly through the CNN structure, and it is necessary to learn and train a fully connected network based on CNN features in the experiment.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed for converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=46478Task Understanding from Confusing Multi-task Data2020-11-25T21:54:52Z<p>Y93fang: /* Description of the Problem */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, and even self-driving cars. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates identifies an input to its respective task and the latter finds the relationship between the input and its label. See figure 1(b). To train a network of CSL, CSL-Net is constructed for representing CSL’s variables. However, this structure cannot be optimized by gradient back-propagation. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization. <br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine sigh complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. In multi-task learning, the task to which every sample belongs is known. With this task definition, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. Without this annotation, our goal is to understand the task concept from confusing input-label pairs. Overall, It requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, samples are generated from multiple distributions instead of one distribution. Thus, latent variable learning is insufficient to solve the research problem, which is multi-task confusing samples.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, letting (x,y) be the training samples from <math>y=f(x)</math>, an identical but unknown mapping relationship. Assuming the risk measure is mean squared error (MSE), the expected risk functional is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the prior distribution of the input variable <math>x</math>. In practice, model optimizations are performed using the empirical risk<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk functional can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to recover both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of the MNIST digit data that has been colored. Each observation in this modified set consists of a colored image (<math>x_i</math>) and either the color, or the digit it represents (<math>y_i</math>). The goal is to recover the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': This data set has more observations than the "colored-MNIST" data and consists of pictures labeled with either the “Gender”, “Category”, and “Color” of the clothing item.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': In order to "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labeling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
<br />
Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reaction from text.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, there are some limitations that can be improved for future work:<br />
The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; The current algorithm is difficult to learn basic features directly through the CNN structure, and it is necessary to learn and train a fully connected network based on CNN features in the experiment.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed for converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=46475Task Understanding from Confusing Multi-task Data2020-11-25T21:36:41Z<p>Y93fang: /* Introduction */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, and even self-driving cars. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates identifies an input to its respective task and the latter finds the relationship between the input and its label. See figure 1(b). To train a network of CSL, CSL-Net is constructed for representing CSL’s variables. However, this structure cannot be optimized by gradient back-propagation. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization. <br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine sigh complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. In multi-task learning, the task to which every sample belongs is known. With this task definition, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. Without this annotation, our goal is to understand the task concept from confusing input-label pairs. Overall, It requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, samples are generated from multiple distributions instead of one distribution. Thus, latent variable learning is insufficient to solve the research problem, which is multi-task confusing samples.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, assuming the risk measure is mean squared error (MSE), the expected risk functional is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the prior distribution of the input variable <math>x</math>. In practice, model optimizations are performed using the empirical risk<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk functional can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to recover both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of the MNIST digit data that has been colored. Each observation in this modified set consists of a colored image (<math>x_i</math>) and either the color, or the digit it represents (<math>y_i</math>). The goal is to recover the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': This data set has more observations than the "colored-MNIST" data and consists of pictures labeled with either the “Gender”, “Category”, and “Color” of the clothing item.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': In order to "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labeling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
<br />
Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reaction from text.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, there are some limitations that can be improved for future work:<br />
The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; The current algorithm is difficult to learn basic features directly through the CNN structure, and it is necessary to learn and train a fully connected network based on CNN features in the experiment.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed for converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:T358wang&diff=46471User:T358wang2020-11-25T21:05:28Z<p>Y93fang: /* Methodology */</p>
<hr />
<div><br />
== Group ==<br />
Rui Chen, Zeren Shen, Zihao Guo, Taohao Wang<br />
<br />
== Introduction ==<br />
<br />
Landmark recognition is an image retrieval task with its own specific challenges. This paper provides a new and effective method to recognize landmark images, which has been successfully applied to actual images. In this way, statues, buildings, and characteristic objects can be effectively identified.<br />
<br />
There are many difficulties encountered in the development process:<br />
<br />
1. The first problem is that the concept of landmarks cannot be strictly defined, because landmarks can be any object and building.<br />
<br />
2. The second problem is that the same landmark can be photographed from different angles. The results of the multi-angle shooting will result in very different picture characteristics. But the system needs to accurately identify different landmarks. We may also need to consider angles that capture the interior of a building versus the exterior of it, a good model will be able to recognize both.<br />
<br />
3. The third problem is that the landmark recognition system must recognize a large number of landmarks, and the recognition must achieve high accuracy. The challenge here is that there are significantly more non-landmarks objects in the real world.<br />
<br />
These problems require that the system should have a very low false alarm rate and high recognition accuracy. <br />
There are also three potential problems:<br />
<br />
1. The processed data set contains a little error content, the image content is not clean and the quantity is huge.<br />
<br />
2. The algorithm for learning the training set must be fast and scalable.<br />
<br />
3. While displaying high-quality judgment landmarks, there is no image geographic information mixed.<br />
<br />
The article describes the deep convolutional neural network (CNN) architecture, loss function, training method, and inference aspects. Using this model, similar metrics to the state of the art model in the test were obtained and the inference time was found to be 15 times faster. Further, because of the efficient architecture, the system can serve in an online fashion. The results of quantitative experiments will be displayed through testing and deployment effect analysis to prove the effectiveness of the model.<br />
<br />
== Related Work ==<br />
<br />
Landmark recognition can be regarded as one of the tasks of image retrieval, and a large number of documents concentrate on image retrieval tasks. In the past two decades, the field of image retrieval has made significant progress, and the main methods can be divided into two categories. <br />
The first is a classic retrieval method using local features, a method based on local feature descriptors organized in bag-of-words(A bag of words model is defined as a simplified representation of the text information by retrieving only the significant words in a sentence or paragraph while disregarding its grammar. The bag of words approach is commonly used in classification tasks where the words are used as features in the model-training), spatial verification, Hamming embedding, and query expansion. These methods are dominant in image retrieval. Later, until the rise of deep convolutional neural networks (CNN), CNNs were used to generate global descriptors of input images.<br />
Another method is to selectively match the kernel Hamming embedding method extension. With the advent of deep convolutional neural networks, the most effective image retrieval method is based on training CNNs for specific tasks. Deep networks are very powerful for semantic feature representation, which allows us to effectively use them for landmark recognition. This method shows good results but brings additional memory and complexity costs. <br />
The DELF (DEep local feature) by Noh et al. proved promising results. This method combines the classic local feature method with deep learning. This allows us to extract local features from the input image and then use RANSAC for geometric verification. Random Sample Consensus (RANSAC) is a method to smooth data containing a significant percentage of errors, which is ideally suited for applications in automated image analysis where interpretation is based on the data generated by error-prone feature detectors. The goal of the project is to describe a method for accurate and fast large-scale landmark recognition using the advantages of deep convolutional neural networks.<br />
<br />
== Methodology ==<br />
<br />
This section will describe in detail the CNN architecture, loss function, training procedure, and inference implementation of the landmark recognition system. The figure below is an overview of the landmark recognition system.<br />
<br />
[[File:t358wang_landmark_recog_system.png |center|800px]]<br />
<br />
The landmark CNN consists of three parts including the main network, the embedding layer, and the classification layer. To obtain a CNN main network suitable for training landmark recognition model, fine-tuning is applied and several pre-trained backbones (Residual Networks) based on other similar datasets, including ResNet-50, ResNet-200, SE-ResNext-101, and Wide Residual Network (WRN-50-2), are evaluated based on inference quality and efficiency. Based on the evaluation results, WRN-50-2 is selected as the optimal backbone architecture. Fine-tuning is a very efficient technique in various computer vision applications because we can take advantage of everything the model has already learned and applied it to our specific task.<br />
<br />
[[File:t358wang_backbones.png |center|600px]]<br />
<br />
For the embedding layer, as shown in the below figure, the last fully-connected layer after the averaging pool is removed. Instead, a fully-connected 2048 <math>\times</math> 512 layer and a batch normalization are added as the embedding layer. After the batch norm, a fully-connected 512 <math>\times</math> n layer is added as the classification layer. The below figure shows the overview of the CNN architecture of the landmark recognition system.<br />
<br />
[[File:t358wang_network_arch.png |center|800px]]<br />
<br />
To effectively determine the embedding vectors for each landmark class (centroids), the network needs to be trained to have the members of each class to be as close as possible to the centroids. Several suitable loss functions are evaluated including Contrastive Loss, Arcface, and Center loss. The center loss is selected since it achieves the optimal test results and it trains a center of embeddings of each class and penalizes distances between image embeddings as well as their class centers. In addition, the center loss is a simple addition to softmax loss and is trivial to implement.<br />
<br />
When implementing the loss function, a new additional class that includes all non-landmark instances needs to be added and the center loss function needs to be modified as follows: Let n be the number of landmark classes, m be the mini-batch size, <math>x_i \in R^d</math> is the i-th embedding and <math>y_i</math> is the corresponding label where <math>y_i \in</math> {1,...,n,n+1}, n+1 is the label of the non-landmark class. Denote <math>W \in R^{d \times n}</math> as the weights of the classifier layer, <math>W_j</math> as its j-th column. Let <math>c_{y_i}</math> be the <math>y_i</math> th embeddings center from Center loss and <math>\lambda</math> be the balancing parameter of Center loss. Then the final loss function will be: <br />
<br />
[[File:t358wang_loss_function.png |center|600px]]<br />
<br />
In the training procedure, the stochastic gradient descent(SGD) will be used as the optimizer with momentum=0.9 and weight decay = 5e-3. For the center loss function, the parameter <math>\lambda</math> is set to 5e-5. Each image is resized to 256 <math>\times</math> 256 and several data augmentations are applied to the dataset including random resized crop, color jitter, and random flip. The training dataset is divided into four parts based on the geographical affiliation of cities where landmarks are located: Europe/Russia, North America/Australia/Oceania, Middle East/North Africa, and the Far East Regions. <br />
<br />
The paper introduces curriculum learning for landmark recognition, which is shown in the below figure. The algorithm is trained for 30 epochs and the learning rate <math>\alpha_1, \alpha_2, \alpha_3</math> will be reduced by a factor of 10 at the 12th epoch and 24th epoch.<br />
<br />
[[File:t358wang_algorithm1.png |center|600px]]<br />
<br />
In the inference phase, the paper introduces the term “centroids” which are embedding vectors that are calculated by averaging embeddings and are used to describe landmark classes. The calculation of centroids is significant to effectively determine whether a query image contains a landmark. The paper proposes two approaches to help the inference algorithm to calculate the centroids. First, instead of using the entire training data for each landmark, data cleaning is done to remove most of the redundant and irrelevant elements in the image. Second, since each landmark can have different shooting angles, it is more efficient to calculate a separate centroid for each shooting angle. Hence, a hierarchical agglomerative clustering algorithm is proposed to partition training data into several valid clusters for each landmark and the set of centroids for a landmark L can be represented by <math>\mu_{l_j} = \frac{1}{|C_j|} \sum_{i \in C_j} x_i, j \in 1,...,v</math> where v is the number of valid clusters for landmark L and v=1 if there is no valid clusters for L. <br />
<br />
Once the centroids are calculated for each landmark class, the system can make decisions whether there is any landmark in an image. The query image is passed through the landmark CNN and the resulting embedding vector is compared with all centroids by dot product similarity using approximate k-nearest neighbors (AKNN). To distinguish landmark classes from non-landmark, a threshold <math>\eta</math> is set and it will be compared with the maximum similarity to determine if the image contains any landmarks.<br />
<br />
The full inference algorithm is described in the below figure.<br />
<br />
[[File:t358wang_algorithm2.png |center|600px]]<br />
<br />
== Experiments and Analysis ==<br />
<br />
'''Offline test'''<br />
<br />
In order to measure the quality of the model, an offline test set was collected and manually labeled. According to the calculations, photos containing landmarks make up 1 − 3% of the total number of photos on average. This distribution was emulated in an offline test, and the geo-information and landmark references weren’t used. <br />
The results of this test are presented in the table below. Two metrics were used to measure the results of experiments: Sensitivity — the accuracy of a model on images with landmarks (also called Recall) and Specificity — the accuracy of a model on images without landmarks. Several types of DELF were evaluated, and the best results in terms of sensitivity and specificity were included in the table below. The table also contains the results of the model trained only with Softmax loss, Softmax, and Center loss. Thus, the table below reflects improvements in our approach with the addition of new elements in it.<br />
<br />
[[File:t358wang_models_eval.png |center|600px]]<br />
<br />
It’s very important to understand how a model works on “rare” landmarks due to the small amount of data for them. Therefore, the behavior of the model was examined separately on “rare” and “frequent” landmarks in the table below. The column “Part from total number” shows what percentage of landmark examples in the offline test has the corresponding type of landmarks. And we find that the sensitivity of “frequent” landmarks is much higher than “rare” landmarks.<br />
<br />
[[File:t358wang_rare_freq.png |center|600px]]<br />
<br />
Analysis of the behavior of the model in different categories of landmarks in the offline test is presented in the table below. These results show that the model can successfully work with various categories of landmarks. Predictably better results (92% of sensitivity and 99.5% of specificity) could also be obtained when the offline test with geo-information was launched on the model.<br />
<br />
[[File:t358wang_landmark_category.png |center|600px]]<br />
<br />
'''Revisited Paris dataset'''<br />
<br />
Revisited Paris dataset (RPar)[2] was also used to measure the quality of the landmark recognition approach. This dataset with Revisited Oxford (ROxf) is standard benchmarks for the comparison of image retrieval algorithms. In recognition, it is important to determine the landmark, which is contained in the query image. Images of the same landmark can have different shooting angles or taken inside/outside the building. Thus, it is reasonable to measure the quality of the model in the standard and adapt it to the task settings. That means not all classes from queries are presented in the landmark dataset. For those images containing correct landmarks but taken from different shooting angles within the building, we transferred them to the “junk” category, which does not influence the final score and makes the test markup closer to our model’s goal. Results on RPar with and without distractors in medium and hard modes are presented in the table below. <br />
<br />
[[File:t358wang_methods_eval1.png |center|600px]]<br />
<br />
[[File:t358wang_methods_eval2.png |center|600px]]<br />
<br />
== Comparison ==<br />
<br />
Recent most efficient approaches to landmark recognition are built on fine-tuned CNN. We chose to compare our method to DELF on how well each performs on recognition tasks. A brief summary is given below:<br />
<br />
[[File:t358wang_comparison.png |center|600px]]<br />
<br />
''' Offline test and timing '''<br />
<br />
Both approaches obtained similar results for image retrieval in the offline test (shown in the sensitivity&specificity table), but the proposed approach is much faster on the inference stage and more memory efficient.<br />
<br />
To be more detailed, during the inference stage, DELF needs more forward passes through CNN, has to search the entire database, and performs the RANSAC method for geometric verification. All of them make it much more time-consuming than our proposed approach. Our approach mainly uses centroids, this makes it take less time and needs to store fewer elements.<br />
<br />
== Conclusion ==<br />
<br />
In this paper we were hoping to solve some difficulties that emerge when trying to apply landmark recognition to the production level: there might not be a clean & sufficiently large database for interesting tasks, algorithms should be fast, scalable, and should aim for low FP and high accuracy.<br />
<br />
While aiming for these goals, we presented a way of cleaning landmark data. And most importantly, we introduced the usage of embeddings of deep CNN to make recognition fast and scalable, trained by curriculum learning techniques with modified versions of Center loss. Compared to the state-of-the-art methods, this approach shows similar results but is much faster and suitable for implementation on a large scale.<br />
<br />
== Critique ==<br />
The paper selected 5 images per landmark and checked them manually. That means the training process takes a long time on data cleaning and so the proposed algorithm lacks reusability. Also, since only the landmarks that are the largest and most popular were used to train the CNN, the trained model will probably be most useful in big cities instead of smaller cities with less popular landmarks.<br />
<br />
It might be worth looking into the ability to generalize better. <br />
<br />
This is a very interesting implementation in some specific field. The paper shows a process to analyze the problem and trains the model based on deep CNN implementation. In future work, it would be some practical advice to compare the deep CNN model with other models. By comparison, we might receive a more comprehensive training model for landmark recognization.<br />
<br />
== References ==<br />
[1] Andrei Boiarov and Eduard Tyantov. 2019. Large Scale Landmark Recognition via Deep Metric Learning. In The 28th ACM International Conference on Information and Knowledge Management (CIKM ’19), November 3–7, 2019, Beijing, China. ACM, New York, NY, USA, 10 pages. https://arxiv.org/pdf/1908.10192.pdf 3357384.3357956<br />
<br />
[2] FilipRadenović,AhmetIscen,GiorgosTolias,YannisAvrithis,andOndřejChum.<br />
2018. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking.<br />
arXiv preprint arXiv:1803.11285 (2018).</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:T358wang&diff=46470User:T358wang2020-11-25T21:03:29Z<p>Y93fang: /* Methodology */</p>
<hr />
<div><br />
== Group ==<br />
Rui Chen, Zeren Shen, Zihao Guo, Taohao Wang<br />
<br />
== Introduction ==<br />
<br />
Landmark recognition is an image retrieval task with its own specific challenges. This paper provides a new and effective method to recognize landmark images, which has been successfully applied to actual images. In this way, statues, buildings, and characteristic objects can be effectively identified.<br />
<br />
There are many difficulties encountered in the development process:<br />
<br />
1. The first problem is that the concept of landmarks cannot be strictly defined, because landmarks can be any object and building.<br />
<br />
2. The second problem is that the same landmark can be photographed from different angles. The results of the multi-angle shooting will result in very different picture characteristics. But the system needs to accurately identify different landmarks. We may also need to consider angles that capture the interior of a building versus the exterior of it, a good model will be able to recognize both.<br />
<br />
3. The third problem is that the landmark recognition system must recognize a large number of landmarks, and the recognition must achieve high accuracy. The challenge here is that there are significantly more non-landmarks objects in the real world.<br />
<br />
These problems require that the system should have a very low false alarm rate and high recognition accuracy. <br />
There are also three potential problems:<br />
<br />
1. The processed data set contains a little error content, the image content is not clean and the quantity is huge.<br />
<br />
2. The algorithm for learning the training set must be fast and scalable.<br />
<br />
3. While displaying high-quality judgment landmarks, there is no image geographic information mixed.<br />
<br />
The article describes the deep convolutional neural network (CNN) architecture, loss function, training method, and inference aspects. Using this model, similar metrics to the state of the art model in the test were obtained and the inference time was found to be 15 times faster. Further, because of the efficient architecture, the system can serve in an online fashion. The results of quantitative experiments will be displayed through testing and deployment effect analysis to prove the effectiveness of the model.<br />
<br />
== Related Work ==<br />
<br />
Landmark recognition can be regarded as one of the tasks of image retrieval, and a large number of documents concentrate on image retrieval tasks. In the past two decades, the field of image retrieval has made significant progress, and the main methods can be divided into two categories. <br />
The first is a classic retrieval method using local features, a method based on local feature descriptors organized in bag-of-words(A bag of words model is defined as a simplified representation of the text information by retrieving only the significant words in a sentence or paragraph while disregarding its grammar. The bag of words approach is commonly used in classification tasks where the words are used as features in the model-training), spatial verification, Hamming embedding, and query expansion. These methods are dominant in image retrieval. Later, until the rise of deep convolutional neural networks (CNN), CNNs were used to generate global descriptors of input images.<br />
Another method is to selectively match the kernel Hamming embedding method extension. With the advent of deep convolutional neural networks, the most effective image retrieval method is based on training CNNs for specific tasks. Deep networks are very powerful for semantic feature representation, which allows us to effectively use them for landmark recognition. This method shows good results but brings additional memory and complexity costs. <br />
The DELF (DEep local feature) by Noh et al. proved promising results. This method combines the classic local feature method with deep learning. This allows us to extract local features from the input image and then use RANSAC for geometric verification. Random Sample Consensus (RANSAC) is a method to smooth data containing a significant percentage of errors, which is ideally suited for applications in automated image analysis where interpretation is based on the data generated by error-prone feature detectors. The goal of the project is to describe a method for accurate and fast large-scale landmark recognition using the advantages of deep convolutional neural networks.<br />
<br />
== Methodology ==<br />
<br />
This section will describe in detail the CNN architecture, loss function, training procedure, and inference implementation of the landmark recognition system. The figure below is an overview of the landmark recognition system.<br />
<br />
[[File:t358wang_landmark_recog_system.png |center|800px]]<br />
<br />
The landmark CNN consists of three parts including the main network, the embedding layer, and the classification layer. To obtain a CNN main network suitable for training landmark recognition model, fine-tuning is applied and several pre-trained backbones (Residual Networks) based on other similar datasets, including ResNet-50, ResNet-200, SE-ResNext-101, and Wide Residual Network (WRN-50-2), are evaluated based on inference quality and efficiency. Based on the evaluation results, WRN-50-2 is selected as the optimal backbone architecture. Fine-tuning is a very efficient technique in various computer vision applications because we can take advantage of everything the model has already learned and applied it to our specific task.<br />
<br />
[[File:t358wang_backbones.png |center|600px]]<br />
<br />
For the embedding layer, as shown in the below figure, the last fully-connected layer after the averaging pool is removed. Instead, a fully-connected 2048 <math>\times</math> 512 layer and a batch normalization are added as the embedding layer. After the batch norm, a fully-connected 512 <math>\times</math> n layer is added as the classification layer. The below figure shows the overview of the CNN architecture of the landmark recognition system.<br />
<br />
[[File:t358wang_network_arch.png |center|800px]]<br />
<br />
To effectively determine the embedding vectors for each landmark class (centroids), the network needs to be trained to have the members of each class to be as close as possible to the centroids. Several suitable loss functions are evaluated including Contrastive Loss, Arcface, and Center loss. The center loss is selected since it achieves the optimal test results and it trains a center of embeddings of each class and penalizes distances between image embeddings as well as their class centers. In addition, the center loss is a simple addition to softmax loss and is trivial to implement.<br />
<br />
When implementing the loss function, a new additional class that includes all non-landmark instances needs to be added and the center loss function needs to be modified as follows: Let n be the number of landmark classes, m be the mini-batch size, <math>x_i \in R^d</math> is the i-th embedding and <math>y_i</math> is the corresponding label where <math>y_i \in</math> {1,...,n,n+1}, n+1 is the label of the non-landmark class. Denote <math>W \in R^{d \times n}</math> as the weights of the classifier layer, <math>W_j</math> as its j-th column. Let <math>c_{y_i}</math> be the <math>y_i</math> th embeddings center from Center loss and <math>\lambda</math> be the balancing parameter of Center loss. Then the final loss function will be: <br />
<br />
[[File:t358wang_loss_function.png |center|600px]]<br />
<br />
In the training procedure, the stochastic gradient descent(SGD) will be used as the optimizer with momentum=0.9 and weight decay = 5e-3. For the center loss function, the parameter <math>\lambda</math> is set to 5e-5. Each image is resized to 256 <math>\times</math> 256 and several data augmentations are applied to the dataset including random resized crop, color jitter, and random flip. The training dataset is divided into four parts based on the geographical affiliation of cities where landmarks are located: Europe/Russia, North America/Australia/Oceania, Middle East/North Africa, and the Far East Regions. <br />
<br />
The paper introduces curriculum learning for landmark recognition, which is shown in the below figure. The algorithm is trained for 30 epochs and the learning rate <math>\alpha_1, \alpha_2, \alpha_3</math> will be reduced by a factor of 10 at the 12th epoch and 24th epoch.<br />
<br />
[[File:t358wang_algorithm1.png |center|600px]]<br />
<br />
In the inference phase, the paper introduces the term “centroids” which are embedding vectors that are calculated by averaging embeddings and are used to describe landmark classes. The calculation of centroids is significant to effectively determine whether a query image contains a landmark. The paper proposes two approaches to help the inference algorithm to calculate the centroids. First, instead of using the entire training data for each landmark, data cleaning is done to remove most of the redundant and irrelevant elements in the image. Second, since each landmark can have different shooting angles, it is more efficient to calculate a separate centroid for each shooting angle. Hence, a hierarchical agglomerative clustering algorithm is proposed to partition training data into several valid clusters for each landmark and the set of centroids for a landmark L can be represented by <math>\mu_{l_j} = \frac{1}{|C_j|} \sum_{i \in C_j} x_i, C_j \in 1,...,v</math> where v is the number of valid clusters for landmark L and v=1 if there is no valid clusters for L. <br />
<br />
Once the centroids are calculated for each landmark class, the system can make decisions whether there is any landmark in an image. The query image is passed through the landmark CNN and the resulting embedding vector is compared with all centroids by dot product similarity using approximate k-nearest neighbors (AKNN). To distinguish landmark classes from non-landmark, a threshold <math>\eta</math> is set and it will be compared with the maximum similarity to determine if the image contains any landmarks.<br />
<br />
The full inference algorithm is described in the below figure.<br />
<br />
[[File:t358wang_algorithm2.png |center|600px]]<br />
<br />
== Experiments and Analysis ==<br />
<br />
'''Offline test'''<br />
<br />
In order to measure the quality of the model, an offline test set was collected and manually labeled. According to the calculations, photos containing landmarks make up 1 − 3% of the total number of photos on average. This distribution was emulated in an offline test, and the geo-information and landmark references weren’t used. <br />
The results of this test are presented in the table below. Two metrics were used to measure the results of experiments: Sensitivity — the accuracy of a model on images with landmarks (also called Recall) and Specificity — the accuracy of a model on images without landmarks. Several types of DELF were evaluated, and the best results in terms of sensitivity and specificity were included in the table below. The table also contains the results of the model trained only with Softmax loss, Softmax, and Center loss. Thus, the table below reflects improvements in our approach with the addition of new elements in it.<br />
<br />
[[File:t358wang_models_eval.png |center|600px]]<br />
<br />
It’s very important to understand how a model works on “rare” landmarks due to the small amount of data for them. Therefore, the behavior of the model was examined separately on “rare” and “frequent” landmarks in the table below. The column “Part from total number” shows what percentage of landmark examples in the offline test has the corresponding type of landmarks. And we find that the sensitivity of “frequent” landmarks is much higher than “rare” landmarks.<br />
<br />
[[File:t358wang_rare_freq.png |center|600px]]<br />
<br />
Analysis of the behavior of the model in different categories of landmarks in the offline test is presented in the table below. These results show that the model can successfully work with various categories of landmarks. Predictably better results (92% of sensitivity and 99.5% of specificity) could also be obtained when the offline test with geo-information was launched on the model.<br />
<br />
[[File:t358wang_landmark_category.png |center|600px]]<br />
<br />
'''Revisited Paris dataset'''<br />
<br />
Revisited Paris dataset (RPar)[2] was also used to measure the quality of the landmark recognition approach. This dataset with Revisited Oxford (ROxf) is standard benchmarks for the comparison of image retrieval algorithms. In recognition, it is important to determine the landmark, which is contained in the query image. Images of the same landmark can have different shooting angles or taken inside/outside the building. Thus, it is reasonable to measure the quality of the model in the standard and adapt it to the task settings. That means not all classes from queries are presented in the landmark dataset. For those images containing correct landmarks but taken from different shooting angles within the building, we transferred them to the “junk” category, which does not influence the final score and makes the test markup closer to our model’s goal. Results on RPar with and without distractors in medium and hard modes are presented in the table below. <br />
<br />
[[File:t358wang_methods_eval1.png |center|600px]]<br />
<br />
[[File:t358wang_methods_eval2.png |center|600px]]<br />
<br />
== Comparison ==<br />
<br />
Recent most efficient approaches to landmark recognition are built on fine-tuned CNN. We chose to compare our method to DELF on how well each performs on recognition tasks. A brief summary is given below:<br />
<br />
[[File:t358wang_comparison.png |center|600px]]<br />
<br />
''' Offline test and timing '''<br />
<br />
Both approaches obtained similar results for image retrieval in the offline test (shown in the sensitivity&specificity table), but the proposed approach is much faster on the inference stage and more memory efficient.<br />
<br />
To be more detailed, during the inference stage, DELF needs more forward passes through CNN, has to search the entire database, and performs the RANSAC method for geometric verification. All of them make it much more time-consuming than our proposed approach. Our approach mainly uses centroids, this makes it take less time and needs to store fewer elements.<br />
<br />
== Conclusion ==<br />
<br />
In this paper we were hoping to solve some difficulties that emerge when trying to apply landmark recognition to the production level: there might not be a clean & sufficiently large database for interesting tasks, algorithms should be fast, scalable, and should aim for low FP and high accuracy.<br />
<br />
While aiming for these goals, we presented a way of cleaning landmark data. And most importantly, we introduced the usage of embeddings of deep CNN to make recognition fast and scalable, trained by curriculum learning techniques with modified versions of Center loss. Compared to the state-of-the-art methods, this approach shows similar results but is much faster and suitable for implementation on a large scale.<br />
<br />
== Critique ==<br />
The paper selected 5 images per landmark and checked them manually. That means the training process takes a long time on data cleaning and so the proposed algorithm lacks reusability. Also, since only the landmarks that are the largest and most popular were used to train the CNN, the trained model will probably be most useful in big cities instead of smaller cities with less popular landmarks.<br />
<br />
It might be worth looking into the ability to generalize better. <br />
<br />
This is a very interesting implementation in some specific field. The paper shows a process to analyze the problem and trains the model based on deep CNN implementation. In future work, it would be some practical advice to compare the deep CNN model with other models. By comparison, we might receive a more comprehensive training model for landmark recognization.<br />
<br />
== References ==<br />
[1] Andrei Boiarov and Eduard Tyantov. 2019. Large Scale Landmark Recognition via Deep Metric Learning. In The 28th ACM International Conference on Information and Knowledge Management (CIKM ’19), November 3–7, 2019, Beijing, China. ACM, New York, NY, USA, 10 pages. https://arxiv.org/pdf/1908.10192.pdf 3357384.3357956<br />
<br />
[2] FilipRadenović,AhmetIscen,GiorgosTolias,YannisAvrithis,andOndřejChum.<br />
2018. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking.<br />
arXiv preprint arXiv:1803.11285 (2018).</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:T358wang&diff=46469User:T358wang2020-11-25T20:37:29Z<p>Y93fang: /* Related Work */</p>
<hr />
<div><br />
== Group ==<br />
Rui Chen, Zeren Shen, Zihao Guo, Taohao Wang<br />
<br />
== Introduction ==<br />
<br />
Landmark recognition is an image retrieval task with its own specific challenges. This paper provides a new and effective method to recognize landmark images, which has been successfully applied to actual images. In this way, statues, buildings, and characteristic objects can be effectively identified.<br />
<br />
There are many difficulties encountered in the development process:<br />
<br />
1. The first problem is that the concept of landmarks cannot be strictly defined, because landmarks can be any object and building.<br />
<br />
2. The second problem is that the same landmark can be photographed from different angles. The results of the multi-angle shooting will result in very different picture characteristics. But the system needs to accurately identify different landmarks. We may also need to consider angles that capture the interior of a building versus the exterior of it, a good model will be able to recognize both.<br />
<br />
3. The third problem is that the landmark recognition system must recognize a large number of landmarks, and the recognition must achieve high accuracy. The challenge here is that there are significantly more non-landmarks objects in the real world.<br />
<br />
These problems require that the system should have a very low false alarm rate and high recognition accuracy. <br />
There are also three potential problems:<br />
<br />
1. The processed data set contains a little error content, the image content is not clean and the quantity is huge.<br />
<br />
2. The algorithm for learning the training set must be fast and scalable.<br />
<br />
3. While displaying high-quality judgment landmarks, there is no image geographic information mixed.<br />
<br />
The article describes the deep convolutional neural network (CNN) architecture, loss function, training method, and inference aspects. Using this model, similar metrics to the state of the art model in the test were obtained and the inference time was found to be 15 times faster. Further, because of the efficient architecture, the system can serve in an online fashion. The results of quantitative experiments will be displayed through testing and deployment effect analysis to prove the effectiveness of the model.<br />
<br />
== Related Work ==<br />
<br />
Landmark recognition can be regarded as one of the tasks of image retrieval, and a large number of documents concentrate on image retrieval tasks. In the past two decades, the field of image retrieval has made significant progress, and the main methods can be divided into two categories. <br />
The first is a classic retrieval method using local features, a method based on local feature descriptors organized in bag-of-words(A bag of words model is defined as a simplified representation of the text information by retrieving only the significant words in a sentence or paragraph while disregarding its grammar. The bag of words approach is commonly used in classification tasks where the words are used as features in the model-training), spatial verification, Hamming embedding, and query expansion. These methods are dominant in image retrieval. Later, until the rise of deep convolutional neural networks (CNN), CNNs were used to generate global descriptors of input images.<br />
Another method is to selectively match the kernel Hamming embedding method extension. With the advent of deep convolutional neural networks, the most effective image retrieval method is based on training CNNs for specific tasks. Deep networks are very powerful for semantic feature representation, which allows us to effectively use them for landmark recognition. This method shows good results but brings additional memory and complexity costs. <br />
The DELF (DEep local feature) by Noh et al. proved promising results. This method combines the classic local feature method with deep learning. This allows us to extract local features from the input image and then use RANSAC for geometric verification. Random Sample Consensus (RANSAC) is a method to smooth data containing a significant percentage of errors, which is ideally suited for applications in automated image analysis where interpretation is based on the data generated by error-prone feature detectors. The goal of the project is to describe a method for accurate and fast large-scale landmark recognition using the advantages of deep convolutional neural networks.<br />
<br />
== Methodology ==<br />
<br />
This section will describe in detail the CNN architecture, loss function, training procedure, and inference implementation of the landmark recognition system. The figure below is an overview of the landmark recognition system.<br />
<br />
[[File:t358wang_landmark_recog_system.png |center|800px]]<br />
<br />
The landmark CNN consists of three parts including the main network, the embedding layer, and the classification layer. To obtain a CNN main network suitable for training landmark recognition model, fine-tuning is applied and several pre-trained backbones (Residual Networks) based on other similar datasets, including ResNet-50, ResNet-200, SE-ResNext-101, and Wide Residual Network (WRN-50-2), are evaluated based on inference quality and efficiency. Based on the evaluation results, WRN-50-2 is selected as the optimal backbone architecture. Fine-tuning is a very efficient technique in various computer vision applications because we can take advantage of everything the model has already learned and applied it to our specific task.<br />
<br />
[[File:t358wang_backbones.png |center|600px]]<br />
<br />
For the embedding layer, as shown in the below figure, the last fully-connected layer after the averaging pool is removed. Instead, a fully-connected 2048 <math>\times</math> 512 layer and a batch normalization are added as the embedding layer. After the batch norm, a fully-connected 512 <math>\times</math> n layer is added as the classification layer. The below figure shows the overview of the CNN architecture of the landmark recognition system.<br />
<br />
[[File:t358wang_network_arch.png |center|800px]]<br />
<br />
To effectively determine the embedding vectors for each landmark class (centroids), the network needs to be trained to have the members of each class to be as close as possible to the centroids. Several suitable loss functions are evaluated including Contrastive Loss, Arcface, and Center loss. The center loss is selected since it achieves the optimal test results and it trains a center of embeddings of each class and penalizes distances between image embeddings as well as their class centers. In addition, the center loss is a simple addition to softmax loss and is trivial to implement.<br />
<br />
When implementing the loss function, a new additional class that includes all non-landmark instances needs to be added and the center loss function needs to be modified as follows: Let n be the number of landmark classes, m be the mini-batch size, <math>x_i \in R^d</math> is the i-th embedding and <math>y_i</math> is the corresponding label where <math>y_i \in</math> {1,...,n,n+1}, n+1 is the label of the non-landmark class. Denote <math>W \in R^{d \times n}</math> as the weights of the classifier layer, <math>W_j</math> as its j-th column. Let <math>c_{y_i}</math> be the <math>y_i</math> th embeddings center from Center loss and <math>\lambda</math> be the balancing parameter of Center loss. Then the final loss function will be: <br />
<br />
[[File:t358wang_loss_function.png |center|600px]]<br />
<br />
In the training procedure, the stochastic gradient descent(SGD) will be used as the optimizer with momentum=0.9 and weight decay = 5e-3. For the center loss function, the parameter <math>\lambda</math> is set to 5e-5. Each image is resized to 256 <math>\times</math> 256 and several data augmentations are applied to the dataset including random resized crop, color jitter, and random flip. The training dataset is divided into four parts based on the geographical affiliation of cities where landmarks are located: Europe/Russia, North America/Australia/Oceania, Middle East/North Africa, and the Far East Regions. <br />
<br />
The paper introduces curriculum learning for landmark recognition, which is shown in the below figure. The algorithm is trained for 30 epochs and the learning rate <math>\alpha_1, \alpha_2, \alpha_3</math> will be reduced by a factor of 10 at the 12th epoch and 24th epoch.<br />
<br />
[[File:t358wang_algorithm1.png |center|600px]]<br />
<br />
In the inference phase, the paper introduces the term “centroids” which are embedding vectors that are calculated by averaging embeddings and are used to describe landmark classes. The calculation of centroids is significant to effectively determine whether a query image contains a landmark. The paper proposes two approaches to help the inference algorithm to calculate the centroids. First, instead of using the entire training data for each landmark, data cleaning is done to remove most of the redundant and irrelevant elements in the image. Second, since each landmark can have different shooting angles, it is more efficient to calculate a separate centroid for each shooting angle. Hence, a hierarchical agglomerative clustering algorithm is proposed to partition training data into several valid clusters for each landmark and the set of centroids for a landmark L can be represented by <math>\mu_{l_j} = \frac{1}{v} \sum_{i \in C_j} x_i, j \in 1,...,v</math> where v is the number of valid clusters for landmark L. <br />
<br />
Once the centroids are calculated for each landmark class, the system can make decisions whether there is any landmark in an image. The query image is passed through the landmark CNN and the resulting embedding vector is compared with all centroids by dot product similarity using approximate k-nearest neighbors (AKNN). To distinguish landmark classes from non-landmark, a threshold <math>\eta</math> is set and it will be compared with the maximum similarity to determine if the image contains any landmarks.<br />
<br />
The full inference algorithm is described in the below figure.<br />
<br />
[[File:t358wang_algorithm2.png |center|600px]]<br />
<br />
== Experiments and Analysis ==<br />
<br />
'''Offline test'''<br />
<br />
In order to measure the quality of the model, an offline test set was collected and manually labeled. According to the calculations, photos containing landmarks make up 1 − 3% of the total number of photos on average. This distribution was emulated in an offline test, and the geo-information and landmark references weren’t used. <br />
The results of this test are presented in the table below. Two metrics were used to measure the results of experiments: Sensitivity — the accuracy of a model on images with landmarks (also called Recall) and Specificity — the accuracy of a model on images without landmarks. Several types of DELF were evaluated, and the best results in terms of sensitivity and specificity were included in the table below. The table also contains the results of the model trained only with Softmax loss, Softmax, and Center loss. Thus, the table below reflects improvements in our approach with the addition of new elements in it.<br />
<br />
[[File:t358wang_models_eval.png |center|600px]]<br />
<br />
It’s very important to understand how a model works on “rare” landmarks due to the small amount of data for them. Therefore, the behavior of the model was examined separately on “rare” and “frequent” landmarks in the table below. The column “Part from total number” shows what percentage of landmark examples in the offline test has the corresponding type of landmarks. And we find that the sensitivity of “frequent” landmarks is much higher than “rare” landmarks.<br />
<br />
[[File:t358wang_rare_freq.png |center|600px]]<br />
<br />
Analysis of the behavior of the model in different categories of landmarks in the offline test is presented in the table below. These results show that the model can successfully work with various categories of landmarks. Predictably better results (92% of sensitivity and 99.5% of specificity) could also be obtained when the offline test with geo-information was launched on the model.<br />
<br />
[[File:t358wang_landmark_category.png |center|600px]]<br />
<br />
'''Revisited Paris dataset'''<br />
<br />
Revisited Paris dataset (RPar)[2] was also used to measure the quality of the landmark recognition approach. This dataset with Revisited Oxford (ROxf) is standard benchmarks for the comparison of image retrieval algorithms. In recognition, it is important to determine the landmark, which is contained in the query image. Images of the same landmark can have different shooting angles or taken inside/outside the building. Thus, it is reasonable to measure the quality of the model in the standard and adapt it to the task settings. That means not all classes from queries are presented in the landmark dataset. For those images containing correct landmarks but taken from different shooting angles within the building, we transferred them to the “junk” category, which does not influence the final score and makes the test markup closer to our model’s goal. Results on RPar with and without distractors in medium and hard modes are presented in the table below. <br />
<br />
[[File:t358wang_methods_eval1.png |center|600px]]<br />
<br />
[[File:t358wang_methods_eval2.png |center|600px]]<br />
<br />
== Comparison ==<br />
<br />
Recent most efficient approaches to landmark recognition are built on fine-tuned CNN. We chose to compare our method to DELF on how well each performs on recognition tasks. A brief summary is given below:<br />
<br />
[[File:t358wang_comparison.png |center|600px]]<br />
<br />
''' Offline test and timing '''<br />
<br />
Both approaches obtained similar results for image retrieval in the offline test (shown in the sensitivity&specificity table), but the proposed approach is much faster on the inference stage and more memory efficient.<br />
<br />
To be more detailed, during the inference stage, DELF needs more forward passes through CNN, has to search the entire database, and performs the RANSAC method for geometric verification. All of them make it much more time-consuming than our proposed approach. Our approach mainly uses centroids, this makes it take less time and needs to store fewer elements.<br />
<br />
== Conclusion ==<br />
<br />
In this paper we were hoping to solve some difficulties that emerge when trying to apply landmark recognition to the production level: there might not be a clean & sufficiently large database for interesting tasks, algorithms should be fast, scalable, and should aim for low FP and high accuracy.<br />
<br />
While aiming for these goals, we presented a way of cleaning landmark data. And most importantly, we introduced the usage of embeddings of deep CNN to make recognition fast and scalable, trained by curriculum learning techniques with modified versions of Center loss. Compared to the state-of-the-art methods, this approach shows similar results but is much faster and suitable for implementation on a large scale.<br />
<br />
== Critique ==<br />
The paper selected 5 images per landmark and checked them manually. That means the training process takes a long time on data cleaning and so the proposed algorithm lacks reusability. Also, since only the landmarks that are the largest and most popular were used to train the CNN, the trained model will probably be most useful in big cities instead of smaller cities with less popular landmarks.<br />
<br />
It might be worth looking into the ability to generalize better. <br />
<br />
This is a very interesting implementation in some specific field. The paper shows a process to analyze the problem and trains the model based on deep CNN implementation. In future work, it would be some practical advice to compare the deep CNN model with other models. By comparison, we might receive a more comprehensive training model for landmark recognization.<br />
<br />
== References ==<br />
[1] Andrei Boiarov and Eduard Tyantov. 2019. Large Scale Landmark Recognition via Deep Metric Learning. In The 28th ACM International Conference on Information and Knowledge Management (CIKM ’19), November 3–7, 2019, Beijing, China. ACM, New York, NY, USA, 10 pages. https://arxiv.org/pdf/1908.10192.pdf 3357384.3357956<br />
<br />
[2] FilipRadenović,AhmetIscen,GiorgosTolias,YannisAvrithis,andOndřejChum.<br />
2018. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking.<br />
arXiv preprint arXiv:1803.11285 (2018).</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou&diff=46461User:J46hou2020-11-25T18:57:47Z<p>Y93fang: /* Introduction */</p>
<hr />
<div>DROCC: Deep Robust One-Class Classification<br />
== Presented by == <br />
Jinjiang Lian, Yisheng Zhu, Jiawen Hou, Mingzhe Huang<br />
== Introduction ==<br />
In this work, we study “one-class” classification, where the goal is to obtain accurate discriminators for a special class. Popular uses of this technique include anomaly detection where we are interested in detecting outliers. Anomaly detection is a well-studied area of research, however, the conventional approach of modeling with typical data using a simple function falls short when it comes to complex domains such as vision or speech. Another case where this would be useful is when recognizing “wake-word” while waking up AI systems such as Alexa. <br />
<br />
Deep learning based on anomaly detection methods attempts to learn features automatically but has some limitations. One approach is based on extending the classical data modeling techniques over the learned representations, but in this case, all the points may be mapped to a single point which makes the layer looks "perfect". The second approach is based on learning the salient geometric structure of data and training the discriminator to predict applied transformation. The result could be considered anomalous if the discriminator fails to predict the transform accurately.<br />
<br />
Thus, in this paper, we are presenting a new approach called Deep Robust One-Class Classification (DROCC) to solve the above concerns. DROCC is based on the assumption that the points from the class of interest lie on a well-sampled, locally linear low dimensional manifold. More specifically, we are presenting DROCC-LF which is an outlier-exposure style extension of DROCC. This extension combines the DROCC's anomaly detection loss with standard classification loss over the negative data and exploits the negative examples to learn a Mahalanobis distance.<br />
<br />
== Previous Work ==<br />
Traditional approaches for one-class problems include one-class SVM (Scholkopf et al., 1999) and Isolation Forest (Liu et al., 2008)[9]. One drawback of these approaches is that they involve careful feature engineering when applied to structured domains like images. The current state of art methodology to tackle this kind of problems are: <br />
<br />
1. Approach based on prediction transformations (Golan & El-Yaniv, 2018; Hendrycks et al.,2019a) [1] This approach has some shortcomings in the sense that it depends heavily on an appropriate domain-specific set of transformations that are in general hard to obtain. <br />
<br />
2. Approach of minimizing a classical one-class loss on the learned final layer representations such as DeepSVDD. (Ruff et al.,2018)[2] This method suffers from the fundamental drawback of representation collapse where the model is no longer being able to accurately recognize the feature representations.<br />
<br />
3. Approach based on balancing unbalanced training data sets using methods such as SMOTE to synthetically create outlier data to train models on.<br />
<br />
== Motivation ==<br />
Anomaly detection is a well-studied problem with a large body of research (Aggarwal, 2016; Chandola et al., 2009) [3]. The goal is to identify the outliers - the points not following a typical distribution. <br />
[[File:abnormal.jpeg | thumb | center | 1000px | Abnormal Data (Data Driven Investor, 2020)]]<br />
Classical approaches for anomaly detection are based on modeling the typical data using simple functions over the low-dimensional subspace or a tree-structured partition of the input space to detect anomalies (Sch¨olkopf et al., 1999; Liu et al., 2008; Lakhina et al., 2004) [4], such as constructing a minimum-enclosing ball around the typical data points (Tax & Duin, 2004) [5]. They broadly fall into three categories: AD via generative modeling, Deep Once Class SVM, Transformations based methods. While these techniques are well-suited when the input is featured appropriately, they struggle on complex domains like vision and speech, where hand-designing features are difficult.<br />
<br />
Another related problem is the one-class classification under limited negatives (OCLN). In this case, only a few negative samples are available. The goal is to find a classifier that would not misfire close negatives so that the false positive rate will be low. <br />
<br />
DROCC is robust to representation collapse by involving a discriminative component that is general and empirically accurate on most standard domains like tabular, time-series and vision without requiring any additional side information. DROCC is motivated by the key observation that generally, the typical data lies on a low-dimensional manifold, which is well-sampled in the training data. This is believed to be true even in complex domains such as vision, speech, and natural language (Pless & Souvenir, 2009). [6]<br />
<br />
== Model Explanation ==<br />
[[File:drocc_f1.jpg | center]]<br />
<div align="center">Figure 1</div><br />
(a): A normal data manifold with red dots representing generated anomalous points in Ni(r). <br />
<br />
(b): Decision boundary learned by DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. <br />
<br />
(c), (d): First two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.<br />
<br />
== DROCC ==<br />
The model is based on the assumption that the true data lies on a manifold. As manifolds resemble Euclidean space locally, our discriminative component is based on classifying a point as anomalous if it is outside the union of small L2 norm balls around the training typical points (See Figure 1a, 1b for an illustration). Importantly, the above definition allows us to synthetically generate anomalous points, and we adaptively generate the most effective anomalous points while training via a gradient ascent phase reminiscent of adversarial training. In other words, DROCC has a gradient ascent phase to adaptively add anomalous points to our training set and a gradient descent phase to minimize the classification loss by learning a representation and a classifier on top of the representations to separate typical points from the generated anomalous points. In this way, DROCC automatically learns an appropriate representation (like DeepSVDD) but is robust to a representation collapse as mapping all points to the same value would lead to poor discrimination between normal points and the generated anomalous points.<br />
<br />
== DROCC-LF ==<br />
To especially tackle problems such as anomaly detection and outlier exposure (Hendrycks et al., 2019a) [7], we propose DROCC–LF, an outlier-exposure style extension of DROCC. Intuitively, DROCC–LF combines DROCC’s anomaly detection loss (that is over only the positive data points) with standard classification loss over the negative data. In addition, DROCC–LF exploits the negative examples to learn a Mahalanobis distance to compare points over the manifold instead of using the standard Euclidean distance, which can be inaccurate for high-dimensional data with relatively fewer samples. (See Figure 1c, 1d for illustration)<br />
<br />
== Popular Dataset Benchmark Result ==<br />
<br />
[[File:drocc_auc.jpg | center]]<br />
<div align="center">Figure 2: AUC result</div><br />
<br />
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10 is shown in table 1. DROCC outperforms baselines on most classes, with gains as high at 20%, and notably, nearest neighbors (NN) beats all the baselines on 2 classes.<br />
<br />
[[File:drocc_f1score.jpg | center]]<br />
<div align="center">Figure 3: F1-Score</div><br />
<br />
Table 2 shows F1-Score (with standard deviation) for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets from UCI Machine Learning Repository. DROCC outperforms the baselines on all three datasets by a minimum of 0.07 which is about 11.5% performance increase.<br />
Results on One-class Classification with Limited Negatives (OCLN): <br />
[[File:ocln.jpg | center]]<br />
<div align="center">Figure 4: Sample postives, negatives and close negatives for MNIST digit 0 vs 1 experiment (OCLN).</div><br />
MNIST 0 vs. 1 Classification: <br />
We consider an experimental setup on MNIST dataset, where the training data consists of Digit 0, the normal class, and Digit 1 as the anomaly. During the evaluation, in addition to samples from training distribution, we also have half zeros, which act as challenging OOD points (close negatives). These half zeros are generated by randomly masking 50% of the pixels (Figure 2). BCE performs poorly, with a recall of 54% only at a fixed FPR of 3%. DROCC–OE gives a recall value of 98:16% outperforming DeepSAD by a margin of 7%, which gives a recall value of 90:91%. DROCC–LF provides further improvement with a recall of 99:4% at 3% FPR. <br />
<br />
[[File:ocln_2.jpg | center]]<br />
<div align="center">Figure 5: OCLN on Audio Commands.</div><br />
Wake word Detection: <br />
Finally, we evaluate DROCC–LF on the practical problem of wake word detection with low FPR against arbitrary OOD negatives. To this end, we identify a keyword, say “Marvin” from the audio commands dataset (Warden, 2018) [8] as the positive class, and the remaining 34 keywords are labeled as the negative class. For training, we sample points uniformly at random from the above-mentioned dataset. However, for evaluation, we sample positives from the train distribution, but negatives contain a few challenging OOD points as well. Sampling challenging negatives itself is a hard task and is the key motivating reason for studying the problem. So, we manually list close-by keywords to Marvin such as Mar, Vin, Marvelous etc. We then generate audio snippets for these keywords via a speech synthesis tool 2 with a variety of accents.<br />
Figure 3 shows that for 3% and 5% FPR settings, DROCC–LF is significantly more accurate than the baselines. For example, with FPR=3%, DROCC–LF is 10% more accurate than the baselines. We repeated the same experiment with the keyword: Seven, and observed a similar trend. In summary, DROCC–LF is able to generalize well against negatives that are “close” to the true positives even when such negatives were not supplied with the training data.<br />
<br />
== Conclusion and Future Work ==<br />
We introduced DROCC method for deep anomaly detection. It models normal data points using a low-dimensional manifold and hence can compare close points via Euclidean distance. Based on this intuition, DROCC’s optimization is formulated as a saddle point problem which is solved via a standard gradient descent-ascent algorithm. We then extended DROCC to OCLN problem where the goal is to generalize well against arbitrary negatives, assuming the positive class is well sampled and a small number of negative points are also available. Both the methods perform significantly better than strong baselines, in their respective problem settings. <br />
<br />
For computational efficiency, we simplified the projection set of both the methods which can perhaps slow down the convergence of the two methods. Designing optimization algorithms that can work with more stricter set is an exciting research direction. Further, we would also like to rigorously analyze DROCC, assuming enough samples from a low-curvature manifold. Finally, as OCLN is an exciting problem that routinely comes up in a variety of real-world applications, we would like to apply DROCC–LF to a few high impact scenarios.<br />
<br />
The results of this study showed that DROCC is comparatively better for anomaly detection across many different areas, such as tabular data, images, audio, and time series, when compared to existing state-of-the-art techniques.<br />
<br />
It would be interesting to see how the DROCC method performs in situations where the anomaly is very rare, say detecting signals of volacanic explosion from seismic activity data. Such challenging anomalous situations will be a test of endurance for this method and can even help advance work in this area.<br />
<br />
== References ==<br />
[1]: Golan, I. and El-Yaniv, R. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.<br />
<br />
[2]: Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., M¨uller, E., and Kloft, M. Deep one-class classification. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
[3]: Aggarwal, C. C. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition, 2016. ISBN 3319475770.<br />
<br />
[4]: Sch¨olkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., and Platt, J. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, 1999.<br />
<br />
[5]: Tax, D. M. and Duin, R. P. Support vector data description. Machine Learning, 54(1), 2004.<br />
<br />
[6]: Pless, R. and Souvenir, R. A survey of manifold learning for images. IPSJ Transactions on Computer Vision and Applications, 1, 2009.<br />
<br />
[7]: Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations (ICLR), 2019a.<br />
<br />
[8]: Warden, P. Speech commands: A dataset for limited vocabulary speech recognition, 2018. URL https: //arxiv.org/abs/1804.03209.<br />
<br />
[9]: Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008.<br />
<br />
== Critiques/Insights ==<br />
<br />
1. It would be interesting to see this implemented in self-driving cars, for instance, to detect unusual road conditions.<br />
<br />
2. Figure 1 shows a good representation on how this model works. However, how can we know that this model is not prone to overfitting? There are many situations where there are valid points that lie outside of the line, especially new data that the model has never see before. An explanation on how this is avoided would be good.<br />
<br />
3.In the introduction part, it should first explain what is "one class", and then make a detailed application. Moreover, special definition words are used in many places in the text. No detailed explanation was given. In the end, the future application fields of DROCC and the research direction of the group can be explained.<br />
<br />
4. The geometry of this technique (classification based on incidence outside of a ball centred at a known point <math>x</math>) sounds quite similar to K-nearest neighbors. While the authors compared DROCC to single-nearest neighbor classification, choosing a higher K would result in a stronger, more regularized model. It would be interesting to see how DROCC compares to the general KNN classifier</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou&diff=46397User:J46hou2020-11-25T05:39:12Z<p>Y93fang: /* Previous Work */</p>
<hr />
<div>DROCC: Deep Robust One-Class Classification<br />
== Presented by == <br />
Jinjiang Lian, Yisheng Zhu, Jiawen Hou, Mingzhe Huang<br />
== Introduction ==<br />
In this work, we study “one-class” classification, where the goal is to obtain accurate discriminators for a special class. Popular uses of this technique include anomaly detection where we are interested in detecting outliers. Anomaly detection is a well-studied area of research, however, the conventional approach of modeling with typical data using a simple function falls short when it comes to complex domains such as vision or speech. Another case where this would be useful is when recognizing “wake-word” while waking up AI systems such as Alexa. <br />
<br />
Deep learning based on anomaly detection methods attempts to learn features automatically but has some limitations. One approach is based on extending the classical data modeling techniques over the learned representations, but in this case, all the points may be mapped to a single point which makes the layer looks "perfect". The second approach is based on learning the salient geometric structure of data and training the discriminator to predict applied transformation. The result could be considered anomalous if the discriminator fails to predict the transform accurately.<br />
<br />
Thus, in this paper, we are presenting a new approach called Deep Robust One-Class Classification (DROCC) to solve the above concerns. DROCC is based on the assumption that the points from the class of interest lie on a well-sampled, locally linear low dimensional manifold. More specifically, we are presenting DROCC-LF which is an outlier-exposure style extension of DROCC. This extension combines the DROCC's anomaly detection loss with standard classification loss over the negative data.<br />
<br />
== Previous Work ==<br />
Traditional approaches for one-class problems include one-class SVM (Scholkopf et al., 1999) and Isolation Forest (Liu et al., 2008)[9]. One drawback of these approaches is that they involve careful feature engineering when applied to structured domains like images. The current state of art methodology to tackle this kind of problems are: <br />
<br />
1. Approach based on prediction transformations (Golan & El-Yaniv, 2018; Hendrycks et al.,2019a) [1] This approach has some shortcomings in the sense that it depends heavily on an appropriate domain-specific set of transformations that are in general hard to obtain. <br />
<br />
2. Approach of minimizing a classical one-class loss on the learned final layer representations such as DeepSVDD. (Ruff et al.,2018)[2] This method suffers from the fundamental drawback of representation collapse where the model is no longer being able to accurately recognize the feature representations.<br />
<br />
3. Approach based on balancing unbalanced training data sets using methods such as SMOTE to synthetically create outlier data to train models on.<br />
<br />
== Motivation ==<br />
Anomaly detection is a well-studied problem with a large body of research (Aggarwal, 2016; Chandola et al., 2009) [3]. The goal is to identify the outliers - the points not following a typical distribution. <br />
[[File:abnormal.jpeg | thumb | center | 1000px | Abnormal Data (Data Driven Investor, 2020)]]<br />
Classical approaches for anomaly detection are based on modeling the typical data using simple functions over the low-dimensional subspace or a tree-structured partition of the input space to detect anomalies (Sch¨olkopf et al., 1999; Liu et al., 2008; Lakhina et al., 2004) [4], such as constructing a minimum-enclosing ball around the typical data points (Tax & Duin, 2004) [5]. They broadly fall into three categories: AD via generative modeling, Deep Once Class SVM, Transformations based methods. While these techniques are well-suited when the input is featured appropriately, they struggle on complex domains like vision and speech, where hand-designing features are difficult.<br />
<br />
Another related problem is the one-class classification under limited negatives (OCLN). In this case, only a few negative samples are available. The goal is to find a classifier that would not misfire close negatives so that the false positive rate will be low. <br />
<br />
DROCC is robust to representation collapse by involving a discriminative component that is general and empirically accurate on most standard domains like tabular, time-series and vision without requiring any additional side information. DROCC is motivated by the key observation that generally, the typical data lies on a low-dimensional manifold, which is well-sampled in the training data. This is believed to be true even in complex domains such as vision, speech, and natural language (Pless & Souvenir, 2009). [6]<br />
<br />
== Model Explanation ==<br />
[[File:drocc_f1.jpg | center]]<br />
<div align="center">Figure 1</div><br />
(a): A normal data manifold with red dots representing generated anomalous points in Ni(r). <br />
<br />
(b): Decision boundary learned by DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. <br />
<br />
(c), (d): First two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.<br />
<br />
== DROCC ==<br />
The model is based on the assumption that the true data lies on a manifold. As manifolds resemble Euclidean space locally, our discriminative component is based on classifying a point as anomalous if it is outside the union of small L2 norm balls around the training typical points (See Figure 1a, 1b for an illustration). Importantly, the above definition allows us to synthetically generate anomalous points, and we adaptively generate the most effective anomalous points while training via a gradient ascent phase reminiscent of adversarial training. In other words, DROCC has a gradient ascent phase to adaptively add anomalous points to our training set and a gradient descent phase to minimize the classification loss by learning a representation and a classifier on top of the representations to separate typical points from the generated anomalous points. In this way, DROCC automatically learns an appropriate representation (like DeepSVDD) but is robust to a representation collapse as mapping all points to the same value would lead to poor discrimination between normal points and the generated anomalous points.<br />
<br />
== DROCC-LF ==<br />
To especially tackle problems such as anomaly detection and outlier exposure (Hendrycks et al., 2019a) [7], we propose DROCC–LF, an outlier-exposure style extension of DROCC. Intuitively, DROCC–LF combines DROCC’s anomaly detection loss (that is over only the positive data points) with standard classification loss over the negative data. In addition, DROCC–LF exploits the negative examples to learn a Mahalanobis distance to compare points over the manifold instead of using the standard Euclidean distance, which can be inaccurate for high-dimensional data with relatively fewer samples. (See Figure 1c, 1d for illustration)<br />
<br />
== Popular Dataset Benchmark Result ==<br />
<br />
[[File:drocc_auc.jpg | center]]<br />
<div align="center">Figure 2: AUC result</div><br />
<br />
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10 is shown in table 1. DROCC outperforms baselines on most classes, with gains as high at 20%, and notably, nearest neighbors (NN) beats all the baselines on 2 classes.<br />
<br />
[[File:drocc_f1score.jpg | center]]<br />
<div align="center">Figure 3: F1-Score</div><br />
<br />
Table 2 shows F1-Score (with standard deviation) for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets from UCI Machine Learning Repository. DROCC outperforms the baselines on all three datasets by a minimum of 0.07 which is about 11.5% performance increase.<br />
Results on One-class Classification with Limited Negatives (OCLN): <br />
[[File:ocln.jpg | center]]<br />
<div align="center">Figure 4: Sample postives, negatives and close negatives for MNIST digit 0 vs 1 experiment (OCLN).</div><br />
MNIST 0 vs. 1 Classification: <br />
We consider an experimental setup on MNIST dataset, where the training data consists of Digit 0, the normal class, and Digit 1 as the anomaly. During the evaluation, in addition to samples from training distribution, we also have half zeros, which act as challenging OOD points (close negatives). These half zeros are generated by randomly masking 50% of the pixels (Figure 2). BCE performs poorly, with a recall of 54% only at a fixed FPR of 3%. DROCC–OE gives a recall value of 98:16% outperforming DeepSAD by a margin of 7%, which gives a recall value of 90:91%. DROCC–LF provides further improvement with a recall of 99:4% at 3% FPR. <br />
<br />
[[File:ocln_2.jpg | center]]<br />
<div align="center">Figure 5: OCLN on Audio Commands.</div><br />
Wake word Detection: <br />
Finally, we evaluate DROCC–LF on the practical problem of wake word detection with low FPR against arbitrary OOD negatives. To this end, we identify a keyword, say “Marvin” from the audio commands dataset (Warden, 2018) [8] as the positive class, and the remaining 34 keywords are labeled as the negative class. For training, we sample points uniformly at random from the above-mentioned dataset. However, for evaluation, we sample positives from the train distribution, but negatives contain a few challenging OOD points as well. Sampling challenging negatives itself is a hard task and is the key motivating reason for studying the problem. So, we manually list close-by keywords to Marvin such as Mar, Vin, Marvelous etc. We then generate audio snippets for these keywords via a speech synthesis tool 2 with a variety of accents.<br />
Figure 3 shows that for 3% and 5% FPR settings, DROCC–LF is significantly more accurate than the baselines. For example, with FPR=3%, DROCC–LF is 10% more accurate than the baselines. We repeated the same experiment with the keyword: Seven, and observed a similar trend. In summary, DROCC–LF is able to generalize well against negatives that are “close” to the true positives even when such negatives were not supplied with the training data.<br />
<br />
== Conclusion and Future Work ==<br />
We introduced DROCC method for deep anomaly detection. It models normal data points using a low-dimensional manifold and hence can compare close points via Euclidean distance. Based on this intuition, DROCC’s optimization is formulated as a saddle point problem which is solved via a standard gradient descent-ascent algorithm. We then extended DROCC to OCLN problem where the goal is to generalize well against arbitrary negatives, assuming the positive class is well sampled and a small number of negative points are also available. Both the methods perform significantly better than strong baselines, in their respective problem settings. <br />
<br />
For computational efficiency, we simplified the projection set of both the methods which can perhaps slow down the convergence of the two methods. Designing optimization algorithms that can work with more stricter set is an exciting research direction. Further, we would also like to rigorously analyze DROCC, assuming enough samples from a low-curvature manifold. Finally, as OCLN is an exciting problem that routinely comes up in a variety of real-world applications, we would like to apply DROCC–LF to a few high impact scenarios.<br />
<br />
The results of this study showed that DROCC is comparatively better for anomaly detection across many different areas, such as tabular data, images, audio, and time series, when compared to existing state-of-the-art techniques.<br />
<br />
It would be interesting to see how the DROCC method performs in situations where the anomaly is very rare, say detecting signals of volacanic explosion from seismic activity data. Such challenging anomalous situations will be a test of endurance for this method and can even help advance work in this area.<br />
<br />
== References ==<br />
[1]: Golan, I. and El-Yaniv, R. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.<br />
<br />
[2]: Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., M¨uller, E., and Kloft, M. Deep one-class classification. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
[3]: Aggarwal, C. C. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition, 2016. ISBN 3319475770.<br />
<br />
[4]: Sch¨olkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., and Platt, J. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, 1999.<br />
<br />
[5]: Tax, D. M. and Duin, R. P. Support vector data description. Machine Learning, 54(1), 2004.<br />
<br />
[6]: Pless, R. and Souvenir, R. A survey of manifold learning for images. IPSJ Transactions on Computer Vision and Applications, 1, 2009.<br />
<br />
[7]: Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations (ICLR), 2019a.<br />
<br />
[8]: Warden, P. Speech commands: A dataset for limited vocabulary speech recognition, 2018. URL https: //arxiv.org/abs/1804.03209.<br />
<br />
[9]: Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008.<br />
<br />
== Critiques/Insights ==<br />
<br />
1. It would be interesting to see this implemented in self-driving cars, for instance, to detect unusual road conditions.<br />
<br />
2. Figure 1 shows a good representation on how this model works. However, how can we know that this model is not prone to overfitting? There are many situations where there are valid points that lie outside of the line, especially new data that the model has never see before. An explanation on how this is avoided would be good.<br />
<br />
3.In the introduction part, it should first explain what is "one class", and then make a detailed application. Moreover, special definition words are used in many places in the text. No detailed explanation was given. In the end, the future application fields of DROCC and the research direction of the group can be explained.<br />
<br />
4. The geometry of this technique (classification based on incidence outside of a ball centred at a known point <math>x</math>) sounds quite similar to K-nearest neighbors. While the authors compared DROCC to single-nearest neighbor classification, choosing a higher K would result in a stronger, more regularized model. It would be interesting to see how DROCC compares to the general KNN classifier</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou&diff=46380User:J46hou2020-11-25T05:07:08Z<p>Y93fang: /* Conclusion and Future Work */</p>
<hr />
<div>DROCC: Deep Robust One-Class Classification<br />
== Presented by == <br />
Jinjiang Lian, Yisheng Zhu, Jiawen Hou, Mingzhe Huang<br />
== Introduction ==<br />
In this work, we study “one-class” classification, where the goal is to obtain accurate discriminators for a special class. Popular uses of this technique include anomaly detection where we are interested in detecting outliers. Anomaly detection is a well-studied area of research, however, the conventional approach of modeling with typical data using a simple function falls short when it comes to complex domains such as vision or speech. Another case where this would be useful is when recognizing “wake-word” while waking up AI systems such as Alexa. <br />
<br />
Deep learning based on anomaly detection methods attempts to learn features automatically but has some limitations. One approach is based on extending the classical data modeling techniques over the learned representations, but in this case, all the points may be mapped to a single point which makes the layer looks "perfect". The second approach is based on learning the salient geometric structure of data and training the discriminator to predict applied transformation. The result could be considered anomalous if the discriminator fails to predict the transform accurately.<br />
<br />
Thus, in this paper, we are presenting a new approach called Deep Robust One-Class Classification (DROCC) to solve the above concerns. DROCC is based on the assumption that the points from the class of interest lie on a well-sampled, locally linear low dimensional manifold. More specifically, we are presenting DROCC-LF which is an outlier-exposure style extension of DROCC. This extension combines the DROCC's anomaly detection loss with standard classification loss over the negative data.<br />
<br />
== Previous Work ==<br />
Traditional approaches for one-class problems include one-class SVM (Scholkopf et al., 1999) and isolation forest (Liu et al., 2008)[9]. One drawback of these approaches is that they involve careful feature engineering when applied to structured domains like images. The current state of art methodology to tackle this kind of problems are: <br />
<br />
1. Approach based on prediction transformations (Golan & El-Yaniv, 2018; Hendrycks et al.,2019a) [1] This approach has some shortcomings in the sense that it depends heavily on an appropriate domain-specific set of transformations that are in general hard to obtain. <br />
<br />
2. Approach of minimizing a classical one-class loss on the learned final layer representations such as DeepSVDD. (Ruff et al.,2018)[2] This method suffers from the fundamental drawback of representation collapse where the model is no longer being able to accurately recognize the feature representations.<br />
<br />
3. Approach based on balancing unbalanced training data sets using methods such as SMOTE to synthetically create outlier data to train models on.<br />
<br />
== Motivation ==<br />
Anomaly detection is a well-studied problem with a large body of research (Aggarwal, 2016; Chandola et al., 2009) [3]. The goal is to identify the outliers - the points not following a typical distribution. <br />
[[File:abnormal.jpeg | thumb | center | 1000px | Abnormal Data (Data Driven Investor, 2020)]]<br />
Classical approaches for anomaly detection are based on modeling the typical data using simple functions over the low-dimensional subspace or a tree-structured partition of the input space to detect anomalies (Sch¨olkopf et al., 1999; Liu et al., 2008; Lakhina et al., 2004) [4], such as constructing a minimum-enclosing ball around the typical data points (Tax & Duin, 2004) [5]. They broadly fall into three categories: AD via generative modeling, Deep Once Class SVM, Transformations based methods. While these techniques are well-suited when the input is featured appropriately, they struggle on complex domains like vision and speech, where hand-designing features are difficult.<br />
<br />
Another related problem is the one-class classification under limited negatives (OCLN). In this case, only a few negative samples are available. The goal is to find a classifier that would not misfire close negatives so that the false positive rate will be low. <br />
<br />
DROCC is robust to representation collapse by involving a discriminative component that is general and empirically accurate on most standard domains like tabular, time-series and vision without requiring any additional side information. DROCC is motivated by the key observation that generally, the typical data lies on a low-dimensional manifold, which is well-sampled in the training data. This is believed to be true even in complex domains such as vision, speech, and natural language (Pless & Souvenir, 2009). [6]<br />
<br />
== Model Explanation ==<br />
[[File:drocc_f1.jpg | center]]<br />
<div align="center">Figure 1</div><br />
(a): A normal data manifold with red dots representing generated anomalous points in Ni(r). <br />
<br />
(b): Decision boundary learned by DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. <br />
<br />
(c), (d): First two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.<br />
<br />
== DROCC ==<br />
The model is based on the assumption that the true data lies on a manifold. As manifolds resemble Euclidean space locally, our discriminative component is based on classifying a point as anomalous if it is outside the union of small L2 norm balls around the training typical points (See Figure 1a, 1b for an illustration). Importantly, the above definition allows us to synthetically generate anomalous points, and we adaptively generate the most effective anomalous points while training via a gradient ascent phase reminiscent of adversarial training. In other words, DROCC has a gradient ascent phase to adaptively add anomalous points to our training set and a gradient descent phase to minimize the classification loss by learning a representation and a classifier on top of the representations to separate typical points from the generated anomalous points. In this way, DROCC automatically learns an appropriate representation (like DeepSVDD) but is robust to a representation collapse as mapping all points to the same value would lead to poor discrimination between normal points and the generated anomalous points.<br />
<br />
== DROCC-LF ==<br />
To especially tackle problems such as anomaly detection and outlier exposure (Hendrycks et al., 2019a) [7], we propose DROCC–LF, an outlier-exposure style extension of DROCC. Intuitively, DROCC–LF combines DROCC’s anomaly detection loss (that is over only the positive data points) with standard classification loss over the negative data. In addition, DROCC–LF exploits the negative examples to learn a Mahalanobis distance to compare points over the manifold instead of using the standard Euclidean distance, which can be inaccurate for high-dimensional data with relatively fewer samples. (See Figure 1c, 1d for illustration)<br />
<br />
== Popular Dataset Benchmark Result ==<br />
<br />
[[File:drocc_auc.jpg | center]]<br />
<div align="center">Figure 2: AUC result</div><br />
<br />
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10 is shown in table 1. DROCC outperforms baselines on most classes, with gains as high at 20%, and notably, nearest neighbors (NN) beats all the baselines on 2 classes.<br />
<br />
[[File:drocc_f1score.jpg | center]]<br />
<div align="center">Figure 3: F1-Score</div><br />
<br />
Table 2 shows F1-Score (with standard deviation) for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets from UCI Machine Learning Repository. DROCC outperforms the baselines on all three datasets by a minimum of 0.07 which is about 11.5% performance increase.<br />
Results on One-class Classification with Limited Negatives (OCLN): <br />
[[File:ocln.jpg | center]]<br />
<div align="center">Figure 4: Sample postives, negatives and close negatives for MNIST digit 0 vs 1 experiment (OCLN).</div><br />
MNIST 0 vs. 1 Classification: <br />
We consider an experimental setup on MNIST dataset, where the training data consists of Digit 0, the normal class, and Digit 1 as the anomaly. During the evaluation, in addition to samples from training distribution, we also have half zeros, which act as challenging OOD points (close negatives). These half zeros are generated by randomly masking 50% of the pixels (Figure 2). BCE performs poorly, with a recall of 54% only at a fixed FPR of 3%. DROCC–OE gives a recall value of 98:16% outperforming DeepSAD by a margin of 7%, which gives a recall value of 90:91%. DROCC–LF provides further improvement with a recall of 99:4% at 3% FPR. <br />
<br />
[[File:ocln_2.jpg | center]]<br />
<div align="center">Figure 5: OCLN on Audio Commands.</div><br />
Wake word Detection: <br />
Finally, we evaluate DROCC–LF on the practical problem of wake word detection with low FPR against arbitrary OOD negatives. To this end, we identify a keyword, say “Marvin” from the audio commands dataset (Warden, 2018) [8] as the positive class, and the remaining 34 keywords are labeled as the negative class. For training, we sample points uniformly at random from the above-mentioned dataset. However, for evaluation, we sample positives from the train distribution, but negatives contain a few challenging OOD points as well. Sampling challenging negatives itself is a hard task and is the key motivating reason for studying the problem. So, we manually list close-by keywords to Marvin such as Mar, Vin, Marvelous etc. We then generate audio snippets for these keywords via a speech synthesis tool 2 with a variety of accents.<br />
Figure 3 shows that for 3% and 5% FPR settings, DROCC–LF is significantly more accurate than the baselines. For example, with FPR=3%, DROCC–LF is 10% more accurate than the baselines. We repeated the same experiment with the keyword: Seven, and observed a similar trend. In summary, DROCC–LF is able to generalize well against negatives that are “close” to the true positives even when such negatives were not supplied with the training data.<br />
<br />
== Conclusion and Future Work ==<br />
We introduced DROCC method for deep anomaly detection. It models normal data points using a low-dimensional manifold and hence can compare close points via Euclidean distance. Based on this intuition, DROCC’s optimization is formulated as a saddle point problem which is solved via a standard gradient descent-ascent algorithm. We then extended DROCC to OCLN problem where the goal is to generalize well against arbitrary negatives, assuming the positive class is well sampled and a small number of negative points are also available. Both the methods perform significantly better than strong baselines, in their respective problem settings. <br />
<br />
For computational efficiency, we simplified the projection set of both the methods which can perhaps slow down the convergence of the two methods. Designing optimization algorithms that can work with more stricter set is an exciting research direction. Further, we would also like to rigorously analyze DROCC, assuming enough samples from a low-curvature manifold. Finally, as OCLN is an exciting problem that routinely comes up in a variety of real-world applications, we would like to apply DROCC–LF to a few high impact scenarios.<br />
<br />
The results of this study showed that DROCC is comparatively better for anomaly detection across many different areas, such as tabular data, images, audio, and time series, when compared to existing state-of-the-art techniques.<br />
<br />
It would be interesting to see how the DROCC method performs in situations where the anomaly is very rare, say detecting signals of volacanic explosion from seismic activity data. Such challenging anomalous situations will be a test of endurance for this method and can even help advance work in this area.<br />
<br />
== References ==<br />
[1]: Golan, I. and El-Yaniv, R. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.<br />
<br />
[2]: Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., M¨uller, E., and Kloft, M. Deep one-class classification. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
[3]: Aggarwal, C. C. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition, 2016. ISBN 3319475770.<br />
<br />
[4]: Sch¨olkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., and Platt, J. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, 1999.<br />
<br />
[5]: Tax, D. M. and Duin, R. P. Support vector data description. Machine Learning, 54(1), 2004.<br />
<br />
[6]: Pless, R. and Souvenir, R. A survey of manifold learning for images. IPSJ Transactions on Computer Vision and Applications, 1, 2009.<br />
<br />
[7]: Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations (ICLR), 2019a.<br />
<br />
[8]: Warden, P. Speech commands: A dataset for limited vocabulary speech recognition, 2018. URL https: //arxiv.org/abs/1804.03209.<br />
<br />
[9]: Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008.<br />
<br />
== Critiques/Insights ==<br />
<br />
1. It would be interesting to see this implemented in self-driving cars, for instance, to detect unusual road conditions.<br />
<br />
2. Figure 1 shows a good representation on how this model works. However, how can we know that this model is not prone to overfitting? There are many situations where there are valid points that lie outside of the line, especially new data that the model has never see before. An explanation on how this is avoided would be good.<br />
<br />
3.In the introduction part, it should first explain what is "one class", and then make a detailed application. Moreover, special definition words are used in many places in the text. No detailed explanation was given. In the end, the future application fields of DROCC and the research direction of the group can be explained.<br />
<br />
4. The geometry of this technique (classification based on incidence outside of a ball centred at a known point <math>x</math>) sounds quite similar to K-nearest neighbors. While the authors compared DROCC to single-nearest neighbor classification, choosing a higher K would result in a stronger, more regularized model. It would be interesting to see how DROCC compares to the general KNN classifier</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_ODEs&diff=46196Neural ODEs2020-11-24T04:50:46Z<p>Y93fang: /* Implementation */</p>
<hr />
<div>== Introduction ==<br />
Chen et al. propose a new class of neural networks called neural ordinary differential equations (ODEs) in their 2018 paper under the same title. Neural network models, such as residual or recurrent networks, can be generalized as a set of transformations through hidden states (a.k.a layers) <math>\mathbf{h}</math>, given by the equation <br />
<br />
<div style="text-align:center;"><math> \mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t,\theta_t) </math> (1) </div><br />
<br />
where <math>t \in \{0,...,T\}</math> and <math>\theta_t</math> corresponds to the set of parameters or weights in state <math>t</math>. It is important to note that it has been shown (Lu et al., 2017)(Haber<br />
and Ruthotto, 2017)(Ruthotto and Haber, 2018) that Equation 1 can be viewed as an Euler discretization. Given this Euler description, if the number of layers and step size between layers are taken to their limits, then Equation 1 can instead be described continuously in the form of the ODE, <br />
<br />
<div style="text-align:center;"><math> \frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t),t,\theta) </math> (2). </div><br />
<br />
Equation 2 now describes a network where the output layer <math>\mathbf{h}(T)</math> is generated by solving for the ODE at time <math>T</math>, given the initial value at <math>t=0</math>, where <math>\mathbf{h}(0)</math> is the input layer of the network. <br />
<br />
With a vast amount of theory and research in the field of solving ODEs numerically, there are a number of benefits to formulating the hidden state dynamics this way. One major advantage is that a continuous description of the network allows for the calculation of <math>f</math> at arbitrary intervals and locations. The authors provide an example in section five of how the neural ODE network outperforms the discretized version i.e. residual networks, by taking advantage of the continuity of <math>f</math>. A depiction of this distinction is shown in the figure below. <br />
<br />
<div style="text-align:center;"> [[File:NeuralODEs_Fig1.png|350px]] </div><br />
<br />
In section four the authors show that the single-unit bottleneck of normalizing flows can be overcome by constructing a new class of density models that incorporates the neural ODE network formulation.<br />
The next section on automatic differentiation will describe how utilizing ODE solvers allows for the calculation of gradients of the loss function without storing any of the hidden state information. This results in a very low memory requirement for neural ODE networks in comparison to traditional networks that rely on intermediate hidden state quantities for backpropagation.<br />
<br />
== Reverse-mode Automatic Differentiation of ODE Solutions ==<br />
Like most neural networks, optimizing the weight parameters <math>\theta</math> for a neural ODE network involves finding the gradient of a loss function with respect to those parameters. Differentiating in the forward direction is a simple task, however, this method is very computationally expensive and unstable, as it introduces additional numerical error. Instead, the authors suggest that the gradients can be calculated in the reverse-mode with the adjoint sensitivity method (Pontryagin et al., 1962). This "backpropagation" method solves an augmented version of the forward ODE problem but in reverse, which is something that all ODE solvers are capable of. Section 3 provides results showing that this method gives very desirable memory costs and numerical stability. <br />
<br />
The authors provide an example of the adjoint method by considering the minimization of the scalar-valued loss function <math>L</math>, which takes the solution of the ODE solver as its argument.<br />
<br />
<div style="text-align:center;">[[File:NeuralODEs_Eq1.png|700px]],</div> <br />
This minimization problem requires the calculation of <math>\frac{\partial L}{\partial \mathbf{z}(t_0)}</math> and <math>\frac{\partial L}{\partial \theta}</math>.<br />
<br />
The adjoint itself is defined as <math>\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{z}(t)}</math>, which describes the gradient of the loss with respect to the hidden state <math>\mathbf{z}(t)</math>. By taking the first derivative of the adjoint, another ODE arises in the form of,<br />
<br />
<div style="text-align:center;"><math>\frac{d \mathbf{a}(t)}{dt} = -\mathbf{a}(t)^T \frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \mathbf{z}}</math> (3).</div> <br />
<br />
Since the value <math>\mathbf{a}(t_0)</math> is required to minimize the loss, the ODE in equation 3 must be solved backwards in time from <math>\mathbf{a}(t_1)</math>. Solving this problem is dependent on the knowledge of the hidden state <math>\mathbf{z}(t)</math> for all <math>t</math>, which an neural ODE does not save on the forward pass. Luckily, both <math>\mathbf{a}(t)</math> and <math>\mathbf{z}(t)</math> can be calculated in reverse, at the same time, by setting up an augmented version of the dynamics and is shown in the final algorithm. Finally, the derivative <math>dL/d\theta</math> can be expressed in terms of the adjoint and the hidden state as, <br />
<br />
<div style="text-align:center;"><math> \frac{dL}{d\theta} -\int_{t_1}^{t_0} \mathbf{a}(t)^T\frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \theta}dt</math> (4).</div><br />
<br />
To obtain very inexpensive calculations of <math>\frac{\partial f}{\partial z}</math> and <math>\frac{\partial f}{\partial \theta}</math> in equation 3 and 4, automatic differentiation can be utilized. The authors present an algorithm to calculate the gradients of <math>L</math> and their dependent quantities with only one call to an ODE solver and is shown below. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Algorithm1.png|850px]]</div><br />
<br />
If the loss function has a stronger dependence on the hidden states for <math>t \neq t_0,t_1</math>, then Algorithm 1 can be modified to handle multiple calls to the ODESolve step since most ODE solvers have the capability to provide <math>z(t)</math> at arbitrary times. A visual depiction of this scenario is shown below. <br />
<br />
<div style="text-align:center;">[[File:NeuralODES Fig2.png|350px]]</div><br />
<br />
Please see the [https://arxiv.org/pdf/1806.07366.pdf#page=13 appendix] for extended versions of Algorithm 1 and detailed derivations of each equation in this section.<br />
<br />
== Replacing Residual Networks with ODEs for Supervised Learning ==<br />
Section three of the paper investigates an application of the reverse-mode differentiation described in section two, for the training of neural ODE networks on the MNIST digit data set. To solve for the forward pass in the neural ODE network, the following experiment used Adams method, which is an implicit ODE solver. Although it has a marked improvement over explicit ODE solvers in numerical accuracy, integrating backward through the network for backpropagation is still not preferred and the adjoint sensitivity method is used to perform efficient weight optimization. The network with this "backpropagation" technique is referred to as ODE-Net in this section. <br />
<br />
=== Implementation ===<br />
A residual network (ResNet), studied by He et al. (2016), with six standard residual blocks was used as a comparative model for this experiment. The competing model, ODE-net, replaces the residual blocks of the ResNet with the Adams solver. As a hybrid of the two models ResNet and ODE-net, a third network was created called RK-Net, which solves the weight optimization of the neural ODE network explicitly through backward Runge-Kutta integration. The following table shows the training and performance results of each network. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Table1.png|400px]]</div><br />
<br />
Note that <math>L</math> and <math>\tilde{L}</math> are the number of layers in ResNet and the number of function calls that the Adams method makes for the two ODE networks and are effectively analogous quantities. As shown in Table 1, both of the ODE networks achieve comparable performance to that of the ResNet with a notable decrease in memory cost for ODE-net.<br />
<br />
<br />
Another interesting component of ODE networks is the ability to control the tolerance in the ODE solver used and subsequently the numerical error in the solution. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Fig3.png|700px]]</div><br />
<br />
The tolerance of the ODE solver is represented by the colour bar in Figure 3 above and notice that a variety of effects arise from adjusting this parameter. Primarily, if one was to treat the tolerance as a hyperparameter of sorts, you could tune it such that you find a balance between accuracy (Figure 3a) and computational complexity (Figure 3b). Figure 3c also provides further evidence for the benefits of the adjoint method for the backward pass in ODE-nets since there is a nearly 1:0.5 ratio of forward to backward function calls. In the ResNet and RK-Net examples, this ratio is 1:1.<br />
<br />
Additionally, the authors loosely define the concept of depth in a neural ODE network by referring to Figure 3d. Here it's evident that as you continue to train an ODE network, the number of function evaluations the ODE solver performs increases. As previously mentioned, this quantity is comparable to the network depth of a discretized network. However, as the authors note, this result should be seen as the progression of the network's complexity over training epochs, which is something we expect to increase over time.<br />
<br />
== Continuous Normalizing Flows ==<br />
<br />
Section four tackles the implementation of continuous-depth Neural Networks, but to do so, in the first part of section four the authors discuss theoretically how to establish this kind of network through the use of normalizing flows. The authors use a change of variables method presented in other works (Rezende and Mohamed, 2015), (Dinh et al., 2014), to compute the change of a probability distribution if sample points are transformed through a bijective function, <math>f</math>.<br />
<br />
<div style="text-align:center;"><math>z_1=f(z_0) \Rightarrow \log(p(z_1))=\log(p(z_0))-\log|\det\frac{\partial f}{\partial z_0}|</math></div><br />
<br />
Where p(z) is the probability distribution of the samples and <math>det\frac{\partial f}{\partial z_0}</math> is the determinant of the Jacobian which has a cubic cost in the dimension of '''z''' or the number of hidden units in the network. The authors discovered however that transforming the discrete set of hidden layers in the normalizing flow network to continuous transformations simplifies the computations significantly, due primarily to the following theorem:<br />
<br />
'''''Theorem 1:''' (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability p(z(t)) dependent on time. Let dz/dt=f(z(t),t) be a differential equation describing a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the change in log probability also follows a differential equation:''<br />
<br />
<div style="text-align:center;"><math>\frac{\partial \log(p(z(t)))}{\partial t}=-tr\left(\frac{df}{dz(t)}\right)</math></div><br />
<br />
The biggest advantage to using this theorem is that the trace function is a linear function, so if the dynamics of the problem, f, is represented by a sum of functions, then so is the log density. This essentially means that you can now compute flow models with only a linear cost with respect to the number of hidden units, <math>M</math>. In standard normalizing flow models, the cost is <math>O(M^3)</math>, so they will generally fit many layers with a single hidden unit in each layer.<br />
<br />
Finally the authors use these realizations to construct Continuous Normalizing Flow networks (CNFs) by specifying the parameters of the flow as a function of ''t'', ie, <math>f(z(t),t)</math>. They also use a gating mechanism for each hidden unit, <math>\frac{dz}{dt}=\sum_n \sigma_n(t)f_n(z)</math> where <math>\sigma_n(t)\in (0,1)</math> is a separate neural network which learns when to apply each dynamic <math>f_n</math>.<br />
<br />
===Implementation===<br />
<br />
The authors construct two separate types of neural networks to compare against each other, the first is the standard planar Normalizing Flow network (NF) using 64 layers of single hidden units, and the second is their new CNF with 64 hidden units. The NF model is trained over 500,000 iterations using RMSprop, and the CNF network is trained over 10,000 iterations using Adam(algorithm for first-order gradient-based optimization of stochastic objective functions). The loss function is <math>KL(q(x)||p(x))</math> where <math>q(x)</math> is the flow model and <math>p(x)</math> is the target probability density.<br />
<br />
One of the biggest advantages when implementing CNF is that you can train the flow parameters just by performing maximum likelihood estimation on <math>\log(q(x))</math> given <math>p(x)</math>, where <math>q(x)</math> is found via the theorem above, and then reversing the CNF to generate random samples from <math>q(x)</math>. This reversal of the CNF is done with about the same cost of the forward pass which is not able to be done in an NF network. The following two figures demonstrate the ability of CNF to generate more expressive and accurate output data as compared to standard NF networks.<br />
<br />
<div style="text-align:center;"><br />
[[Image:CNFcomparisons.png]]<br />
<br />
[[Image:CNFtransitions.png]]<br />
</div><br />
<br />
Figure 4 shows clearly that the CNF structure exhibits significantly lower loss functions than NF. In figure 5 both networks were tasked with transforming a standard Gaussian distribution into a target distribution, not only was the CNF network more accurate on the two moons target, but also the steps it took along the way are much more intuitive than the output from NF.<br />
<br />
== A Generative Latent Function Time-Series Model ==<br />
<br />
One of the largest issues at play in terms of Neural ODE networks is the fact that in many instances, data points are either very sparsely distributed, or irregularly-sampled. The latent dynamics are discretized and the observations are in the bins of fixed duration. An example of this is medical records which are only updated when a patient visits a doctor or the hospital. To solve this issue the authors had to create a generative time-series model which would be able to fill in the gaps of missing data. The authors consider each time series as a latent trajectory stemming from the initial local state <math>z_{t_0 }</math> and determined from a global set of latent parameters. Given a set of observation times and initial state, the generative model constructs points via the following sample procedure:<br />
<br />
<div style="text-align:center;"><br />
<math><br />
z_{t_0}∼p(z_{t_0}) <br />
</math><br />
</div> <br />
<br />
<div style="text-align:center;"><br />
<math><br />
z_{t_1},z_{t_2},\dots,z_{t_N}={\rm ODESolve}(z_{t_0},f,θ_f,t_0,...,t_N)<br />
</math><br />
</div><br />
<br />
<div style="text-align:center;"><br />
each <br />
<math><br />
x_{t_i}∼p(x│z_{t_i},θ_x)<br />
</math><br />
</div><br />
<br />
<math>f</math> is a function which outputs the gradient <math>\frac{\partial z(t)}{\partial t}=f(z(t),θ_f)</math> which is parameterized via a neural net. In order to train this latent variable model, the authors had to first encode their given data and observation times using an RNN encoder, construct the new points using the trained parameters, then decode the points back into the original space. The following figure describes this process:<br />
<br />
<div style="text-align:center;"><br />
[[Image:EncodingFigure.png]]<br />
</div><br />
<br />
Another variable which could affect the latent state of a time-series model is how often an event actually occurs. The authors solved this by parameterizing the rate of events in terms of a Poisson process. They described the set of independent observation times in an interval <math>\left[t_{start},t_{end}\right]</math> as:<br />
<br />
<div style="text-align:center;"> <br />
<math><br />
{\rm log}(p(t_1,t_2,\dots,t_N ))=\sum_{i=1}^N{\rm log}(\lambda(z(t_i)))-\int_{t_{start}}^{t_{end}}λ(z(t))dt<br />
</math><br />
</div><br />
<br />
where <math>\lambda(*)</math> is parameterized via another neural network.<br />
<br />
===Implementation===<br />
<br />
To test the effectiveness of the Latent time-series ODE model (LODE), they fit the encoder with 25 hidden units, parametrize function f with a one-layer 20 hidden unit network, and the decoder as another neural network with 20 hidden units. They compare this against a standard recurrent neural net (RNN) with 25 hidden units trained to minimize Gaussian log-likelihood. The authors tested both of these network systems on a dataset of 2-dimensional spirals which either rotated clockwise or counter-clockwise and sampled the positions of each spiral at 100 equally spaced time steps. They can then simulate irregularly timed data by taking random amounts of points without replacement from each spiral. The next two figures show the outcome of these experiments:<br />
<br />
<div style="text-align:center;"><br />
[[Image:LODEtestresults.png]] [[Image:SpiralFigure.png|The blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model]]<br />
</div><br />
<br />
In the figure on the right the blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model. It is noted that the LODE performs significantly better than the standard RNN model, especially on smaller sets of data points.<br />
<br />
== Scope and Limitations ==<br />
<br />
Section 6 mainly discusses the scope and limitations of the paper. Firstly while “batching” the training data is a useful step in standard neural nets, and can still be applied here by combining the ODEs associated with each batch, the authors found that controlling the error, in this case, may increase the number of calculations required. In practice, however, the number of calculations did not increase significantly.<br />
<br />
So long as the model proposed in this paper uses finite weights and Lipschitz nonlinearities, then Picard’s existence theorem (Coddington and Levinson, 1955) applies, guaranteeing the solution to the IVP exists and is unique. This theorem holds for the model presented above when the network has finite weights and uses nonlinearities in the Lipshitz class. <br />
<br />
In controlling the amount of error in the model, the authors were only able to reduce tolerances to approximately <math>10^{-3}</math> and <math>10^{-5}</math> in classification and density estimation respectively without also degrading the computational performance.<br />
<br />
The authors believe that reconstructing state trajectories by running the dynamics backward can introduce extra numerical error. They address a possible solution to this problem by checkpointing certain time steps and storing intermediate values of z on the forward pass. Then while reconstructing, you do each part individually between checkpoints. The authors acknowledged that they informally checked the validity of this method since they don’t consider it a practical problem.<br />
<br />
There remain, however, areas where standard neural networks may perform better than Neural ODEs. Firstly, conventional nets can fit non-homeomorphic functions, for example, functions whose output has a smaller dimension that their input, or that change the topology of the input space. However, this could be handled by composing ODE nets with standard network layers. Another point is that conventional nets can be evaluated exactly with a fixed amount of computation, and are typically faster to train and do not require an error tolerance for a solver.<br />
<br />
== Conclusions and Critiques ==<br />
<br />
We covered the use of black-box ODE solvers as a model component and their application to initial value problems constructed from real applications. Neural ODE Networks show promising gains in computational cost without large sacrifices in accuracy when applied to certain problems. A drawback of some of these implementations is that the ODE Neural Networks are limited by the underlying distributions of the problems they are trying to solve (requirement of Lipschitz continuity, etc.). There are plenty of further advances to be made in this field as hundreds of years of ODE theory and literature is available, so this is currently an important area of research.<br />
<br />
ODEs indeed represent an important area of applied mathematics where neural networks can be used to solve them numerically. Perhaps, a parallel area of investigation can be PDEs (Partial Differential Equations). PDEs are also widely encountered in many areas of applied mathematics, physics, social sciences and many other fields. It will be interesting to see how neural networks can be used to solve PDEs.<br />
<br />
== More Critiques ==<br />
<br />
This paper covers the memory efficiency of Neural ODE Networks, but does not address runtime. In practice, most systems are bound by latency requirements more-so than memory requirements (except in edge device cases). Though it may be unreasonable to expect the authors to produce a performance-optimized implementation, it would be insightful to understand the computational bottlenecks so existing frameworks can take steps to address them. This model looks promising and practical performance is the key to enabling future research in this.<br />
<br />
The above critique also questions the need for a neural network for such a problem. This problem was studied by Brunel et al. and they presented their solution in their paper ''Parametric Estimation of Ordinary Differential Equations with Orthogonality Conditions''. While this solution also requires iteratively solving a complex optimization problem, they did not require the massive memory and runtime overhead of a neural network. For the neural network solution to demonstrate its potential, it should be including experimental comparisons with specialized ordinary differential equation algorithms instead of simply comparing with a general recurrent neural network.<br />
<br />
== References ==<br />
Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. ''arXiv preprint arXiv'':1710.10121, 2017.<br />
<br />
Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. ''Inverse Problems'', 34 (1):014004, 2017.<br />
<br />
Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. ''arXiv preprint arXiv'':1804.04272, 2018.<br />
<br />
Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. ''The mathematical theory of optimal processes''. 1962.<br />
<br />
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ''European conference on computer vision'', pages 630–645. Springer, 2016b.<br />
<br />
Earl A Coddington and Norman Levinson. ''Theory of ordinary differential equations''. Tata McGrawHill Education, 1955.<br />
<br />
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. ''arXiv preprint arXiv:1505.05770'', 2015.<br />
<br />
Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. ''arXiv preprint arXiv:1410.8516'', 2014.<br />
<br />
Brunel, N. J., Clairon, Q., & d’Alché-Buc, F. (2014). Parametric estimation of ordinary differential equations with orthogonality conditions. ''Journal of the American Statistical Association'', 109(505), 173-185.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_ODEs&diff=46195Neural ODEs2020-11-24T04:29:43Z<p>Y93fang: /* Implementation */</p>
<hr />
<div>== Introduction ==<br />
Chen et al. propose a new class of neural networks called neural ordinary differential equations (ODEs) in their 2018 paper under the same title. Neural network models, such as residual or recurrent networks, can be generalized as a set of transformations through hidden states (a.k.a layers) <math>\mathbf{h}</math>, given by the equation <br />
<br />
<div style="text-align:center;"><math> \mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t,\theta_t) </math> (1) </div><br />
<br />
where <math>t \in \{0,...,T\}</math> and <math>\theta_t</math> corresponds to the set of parameters or weights in state <math>t</math>. It is important to note that it has been shown (Lu et al., 2017)(Haber<br />
and Ruthotto, 2017)(Ruthotto and Haber, 2018) that Equation 1 can be viewed as an Euler discretization. Given this Euler description, if the number of layers and step size between layers are taken to their limits, then Equation 1 can instead be described continuously in the form of the ODE, <br />
<br />
<div style="text-align:center;"><math> \frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t),t,\theta) </math> (2). </div><br />
<br />
Equation 2 now describes a network where the output layer <math>\mathbf{h}(T)</math> is generated by solving for the ODE at time <math>T</math>, given the initial value at <math>t=0</math>, where <math>\mathbf{h}(0)</math> is the input layer of the network. <br />
<br />
With a vast amount of theory and research in the field of solving ODEs numerically, there are a number of benefits to formulating the hidden state dynamics this way. One major advantage is that a continuous description of the network allows for the calculation of <math>f</math> at arbitrary intervals and locations. The authors provide an example in section five of how the neural ODE network outperforms the discretized version i.e. residual networks, by taking advantage of the continuity of <math>f</math>. A depiction of this distinction is shown in the figure below. <br />
<br />
<div style="text-align:center;"> [[File:NeuralODEs_Fig1.png|350px]] </div><br />
<br />
In section four the authors show that the single-unit bottleneck of normalizing flows can be overcome by constructing a new class of density models that incorporates the neural ODE network formulation.<br />
The next section on automatic differentiation will describe how utilizing ODE solvers allows for the calculation of gradients of the loss function without storing any of the hidden state information. This results in a very low memory requirement for neural ODE networks in comparison to traditional networks that rely on intermediate hidden state quantities for backpropagation.<br />
<br />
== Reverse-mode Automatic Differentiation of ODE Solutions ==<br />
Like most neural networks, optimizing the weight parameters <math>\theta</math> for a neural ODE network involves finding the gradient of a loss function with respect to those parameters. Differentiating in the forward direction is a simple task, however, this method is very computationally expensive and unstable, as it introduces additional numerical error. Instead, the authors suggest that the gradients can be calculated in the reverse-mode with the adjoint sensitivity method (Pontryagin et al., 1962). This "backpropagation" method solves an augmented version of the forward ODE problem but in reverse, which is something that all ODE solvers are capable of. Section 3 provides results showing that this method gives very desirable memory costs and numerical stability. <br />
<br />
The authors provide an example of the adjoint method by considering the minimization of the scalar-valued loss function <math>L</math>, which takes the solution of the ODE solver as its argument.<br />
<br />
<div style="text-align:center;">[[File:NeuralODEs_Eq1.png|700px]],</div> <br />
This minimization problem requires the calculation of <math>\frac{\partial L}{\partial \mathbf{z}(t_0)}</math> and <math>\frac{\partial L}{\partial \theta}</math>.<br />
<br />
The adjoint itself is defined as <math>\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{z}(t)}</math>, which describes the gradient of the loss with respect to the hidden state <math>\mathbf{z}(t)</math>. By taking the first derivative of the adjoint, another ODE arises in the form of,<br />
<br />
<div style="text-align:center;"><math>\frac{d \mathbf{a}(t)}{dt} = -\mathbf{a}(t)^T \frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \mathbf{z}}</math> (3).</div> <br />
<br />
Since the value <math>\mathbf{a}(t_0)</math> is required to minimize the loss, the ODE in equation 3 must be solved backwards in time from <math>\mathbf{a}(t_1)</math>. Solving this problem is dependent on the knowledge of the hidden state <math>\mathbf{z}(t)</math> for all <math>t</math>, which an neural ODE does not save on the forward pass. Luckily, both <math>\mathbf{a}(t)</math> and <math>\mathbf{z}(t)</math> can be calculated in reverse, at the same time, by setting up an augmented version of the dynamics and is shown in the final algorithm. Finally, the derivative <math>dL/d\theta</math> can be expressed in terms of the adjoint and the hidden state as, <br />
<br />
<div style="text-align:center;"><math> \frac{dL}{d\theta} -\int_{t_1}^{t_0} \mathbf{a}(t)^T\frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \theta}dt</math> (4).</div><br />
<br />
To obtain very inexpensive calculations of <math>\frac{\partial f}{\partial z}</math> and <math>\frac{\partial f}{\partial \theta}</math> in equation 3 and 4, automatic differentiation can be utilized. The authors present an algorithm to calculate the gradients of <math>L</math> and their dependent quantities with only one call to an ODE solver and is shown below. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Algorithm1.png|850px]]</div><br />
<br />
If the loss function has a stronger dependence on the hidden states for <math>t \neq t_0,t_1</math>, then Algorithm 1 can be modified to handle multiple calls to the ODESolve step since most ODE solvers have the capability to provide <math>z(t)</math> at arbitrary times. A visual depiction of this scenario is shown below. <br />
<br />
<div style="text-align:center;">[[File:NeuralODES Fig2.png|350px]]</div><br />
<br />
Please see the [https://arxiv.org/pdf/1806.07366.pdf#page=13 appendix] for extended versions of Algorithm 1 and detailed derivations of each equation in this section.<br />
<br />
== Replacing Residual Networks with ODEs for Supervised Learning ==<br />
Section three of the paper investigates an application of the reverse-mode differentiation described in section two, for the training of neural ODE networks on the MNIST digit data set. To solve for the forward pass in the neural ODE network, the following experiment used Adams method, which is an implicit ODE solver. Although it has a marked improvement over explicit ODE solvers in numerical accuracy, integrating backward through the network for backpropagation is still not preferred and the adjoint sensitivity method is used to perform efficient weight optimization. The network with this "backpropagation" technique is referred to as ODE-Net in this section. <br />
<br />
=== Implementation ===<br />
A residual network (ResNet), studied by He et al. (2016), with six standard residual blocks was used as a comparative model for this experiment. The competing model, ODE-net, replaces the residual blocks of the ResNet with the Adams solver. As a hybrid of the two models ResNet and ODE-net, a third network was created called RK-Net, which solves the weight optimization of the neural ODE network explicitly through backward Runge-Kutta integration. The following table shows the training and performance results of each network. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Table1.png|400px]]</div><br />
<br />
Note that <math>L</math> and <math>\tilde{L}</math> are the number of layers in ResNet and the number of function calls that the Adams method makes for the two ODE networks and are effectively analogous quantities. As shown in Table 1, both of the ODE networks achieve comparable performance to that of the ResNet with a notable decrease in memory cost for ODE-net.<br />
<br />
<br />
Another interesting component of ODE networks is the ability to control the tolerance in the ODE solver used and subsequently the numerical error in the solution. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Fig3.png|700px]]</div><br />
<br />
The tolerance of the ODE solver is represented by the colour bar in Figure 3 above and notice that a variety of effects arise from adjusting this parameter. Primarily, if one was to treat the tolerance as a hyperparameter of sorts, you could tune it such that you find a balance between accuracy (Figure 3a) and computational complexity (Figure 3b). Figure 3c also provides further evidence for the benefits of the adjoint method for the backward pass in ODE-nets since there is a nearly 1:0.5 ratio of forward to backward function calls. In the ResNet and RK-Net examples, this ratio is 1:1.<br />
<br />
Additionally, the authors loosely define the concept of depth in a neural ODE network by referring to Figure 3d. Here it's evident that as you continue to train an ODE network, the number of function evaluations the ODE solver performs increases. As previously mentioned, this quantity is comparable to the network depth of a discretized network. However, as the authors note, this result should be seen as the progression of the network's complexity over training epochs, which is something we expect to increase over time.<br />
<br />
== Continuous Normalizing Flows ==<br />
<br />
Section four tackles the implementation of continuous-depth Neural Networks, but to do so, in the first part of section four the authors discuss theoretically how to establish this kind of network through the use of normalizing flows. The authors use a change of variables method presented in other works (Rezende and Mohamed, 2015), (Dinh et al., 2014), to compute the change of a probability distribution if sample points are transformed through a bijective function, <math>f</math>.<br />
<br />
<div style="text-align:center;"><math>z_1=f(z_0) \Rightarrow \log(p(z_1))=\log(p(z_0))-\log|\det\frac{\partial f}{\partial z_0}|</math></div><br />
<br />
Where p(z) is the probability distribution of the samples and <math>det\frac{\partial f}{\partial z_0}</math> is the determinant of the Jacobian which has a cubic cost in the dimension of '''z''' or the number of hidden units in the network. The authors discovered however that transforming the discrete set of hidden layers in the normalizing flow network to continuous transformations simplifies the computations significantly, due primarily to the following theorem:<br />
<br />
'''''Theorem 1:''' (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability p(z(t)) dependent on time. Let dz/dt=f(z(t),t) be a differential equation describing a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the change in log probability also follows a differential equation:''<br />
<br />
<div style="text-align:center;"><math>\frac{\partial \log(p(z(t)))}{\partial t}=-tr\left(\frac{df}{dz(t)}\right)</math></div><br />
<br />
The biggest advantage to using this theorem is that the trace function is a linear function, so if the dynamics of the problem, f, is represented by a sum of functions, then so is the log density. This essentially means that you can now compute flow models with only a linear cost with respect to the number of hidden units, <math>M</math>. In standard normalizing flow models, the cost is <math>O(M^3)</math>, so they will generally fit many layers with a single hidden unit in each layer.<br />
<br />
Finally the authors use these realizations to construct Continuous Normalizing Flow networks (CNFs) by specifying the parameters of the flow as a function of ''t'', ie, <math>f(z(t),t)</math>. They also use a gating mechanism for each hidden unit, <math>\frac{dz}{dt}=\sum_n \sigma_n(t)f_n(z)</math> where <math>\sigma_n(t)\in (0,1)</math> is a separate neural network which learns when to apply each dynamic <math>f_n</math>.<br />
<br />
===Implementation===<br />
<br />
The authors construct two separate types of neural networks to compare against each other, the first is the standard planar Normalizing Flow network (NF) using 64 layers of single hidden units, and the second is their new CNF with 64 hidden units. The NF model is trained over 500,000 iterations using RMSprop, and the CNF network is trained over 10,000 iterations using Adam. The loss function is <math>KL(q(x)||p(x))</math> where <math>q(x)</math> is the flow model and <math>p(x)</math> is the target probability density.<br />
<br />
One of the biggest advantages when implementing CNF is that you can train the flow parameters just by performing maximum likelihood estimation on <math>\log(q(x))</math> given <math>p(x)</math>, where <math>q(x)</math> is found via the theorem above, and then reversing the CNF to generate random samples from <math>q(x)</math>. This reversal of the CNF is done with about the same cost of the forward pass which is not able to be done in an NF network. The following two figures demonstrate the ability of CNF to generate more expressive and accurate output data as compared to standard NF networks.<br />
<br />
<div style="text-align:center;"><br />
[[Image:CNFcomparisons.png]]<br />
<br />
[[Image:CNFtransitions.png]]<br />
</div><br />
<br />
Figure 4 shows clearly that the CNF structure exhibits significantly lower loss functions than NF. In figure 5 both networks were tasked with transforming a standard Gaussian distribution into a target distribution, not only was the CNF network more accurate on the two moons target, but also the steps it took along the way are much more intuitive than the output from NF.<br />
<br />
== A Generative Latent Function Time-Series Model ==<br />
<br />
One of the largest issues at play in terms of Neural ODE networks is the fact that in many instances, data points are either very sparsely distributed, or irregularly-sampled. The latent dynamics are discretized and the observations are in the bins of fixed duration. An example of this is medical records which are only updated when a patient visits a doctor or the hospital. To solve this issue the authors had to create a generative time-series model which would be able to fill in the gaps of missing data. The authors consider each time series as a latent trajectory stemming from the initial local state <math>z_{t_0 }</math> and determined from a global set of latent parameters. Given a set of observation times and initial state, the generative model constructs points via the following sample procedure:<br />
<br />
<div style="text-align:center;"><br />
<math><br />
z_{t_0}∼p(z_{t_0}) <br />
</math><br />
</div> <br />
<br />
<div style="text-align:center;"><br />
<math><br />
z_{t_1},z_{t_2},\dots,z_{t_N}={\rm ODESolve}(z_{t_0},f,θ_f,t_0,...,t_N)<br />
</math><br />
</div><br />
<br />
<div style="text-align:center;"><br />
each <br />
<math><br />
x_{t_i}∼p(x│z_{t_i},θ_x)<br />
</math><br />
</div><br />
<br />
<math>f</math> is a function which outputs the gradient <math>\frac{\partial z(t)}{\partial t}=f(z(t),θ_f)</math> which is parameterized via a neural net. In order to train this latent variable model, the authors had to first encode their given data and observation times using an RNN encoder, construct the new points using the trained parameters, then decode the points back into the original space. The following figure describes this process:<br />
<br />
<div style="text-align:center;"><br />
[[Image:EncodingFigure.png]]<br />
</div><br />
<br />
Another variable which could affect the latent state of a time-series model is how often an event actually occurs. The authors solved this by parameterizing the rate of events in terms of a Poisson process. They described the set of independent observation times in an interval <math>\left[t_{start},t_{end}\right]</math> as:<br />
<br />
<div style="text-align:center;"> <br />
<math><br />
{\rm log}(p(t_1,t_2,\dots,t_N ))=\sum_{i=1}^N{\rm log}(\lambda(z(t_i)))-\int_{t_{start}}^{t_{end}}λ(z(t))dt<br />
</math><br />
</div><br />
<br />
where <math>\lambda(*)</math> is parameterized via another neural network.<br />
<br />
===Implementation===<br />
<br />
To test the effectiveness of the Latent time-series ODE model (LODE), they fit the encoder with 25 hidden units, parametrize function f with a one-layer 20 hidden unit network, and the decoder as another neural network with 20 hidden units. They compare this against a standard recurrent neural net (RNN) with 25 hidden units trained to minimize Gaussian log-likelihood. The authors tested both of these network systems on a dataset of 2-dimensional spirals which either rotated clockwise or counter-clockwise and sampled the positions of each spiral at 100 equally spaced time steps. They can then simulate irregularly timed data by taking random amounts of points without replacement from each spiral. The next two figures show the outcome of these experiments:<br />
<br />
<div style="text-align:center;"><br />
[[Image:LODEtestresults.png]] [[Image:SpiralFigure.png|The blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model]]<br />
</div><br />
<br />
In the figure on the right the blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model. It is noted that the LODE performs significantly better than the standard RNN model, especially on smaller sets of data points.<br />
<br />
== Scope and Limitations ==<br />
<br />
Section 6 mainly discusses the scope and limitations of the paper. Firstly while “batching” the training data is a useful step in standard neural nets, and can still be applied here by combining the ODEs associated with each batch, the authors found that controlling the error, in this case, may increase the number of calculations required. In practice, however, the number of calculations did not increase significantly.<br />
<br />
So long as the model proposed in this paper uses finite weights and Lipschitz nonlinearities, then Picard’s existence theorem (Coddington and Levinson, 1955) applies, guaranteeing the solution to the IVP exists and is unique. This theorem holds for the model presented above when the network has finite weights and uses nonlinearities in the Lipshitz class. <br />
<br />
In controlling the amount of error in the model, the authors were only able to reduce tolerances to approximately <math>10^{-3}</math> and <math>10^{-5}</math> in classification and density estimation respectively without also degrading the computational performance.<br />
<br />
The authors believe that reconstructing state trajectories by running the dynamics backward can introduce extra numerical error. They address a possible solution to this problem by checkpointing certain time steps and storing intermediate values of z on the forward pass. Then while reconstructing, you do each part individually between checkpoints. The authors acknowledged that they informally checked the validity of this method since they don’t consider it a practical problem.<br />
<br />
There remain, however, areas where standard neural networks may perform better than Neural ODEs. Firstly, conventional nets can fit non-homeomorphic functions, for example, functions whose output has a smaller dimension that their input, or that change the topology of the input space. However, this could be handled by composing ODE nets with standard network layers. Another point is that conventional nets can be evaluated exactly with a fixed amount of computation, and are typically faster to train and do not require an error tolerance for a solver.<br />
<br />
== Conclusions and Critiques ==<br />
<br />
We covered the use of black-box ODE solvers as a model component and their application to initial value problems constructed from real applications. Neural ODE Networks show promising gains in computational cost without large sacrifices in accuracy when applied to certain problems. A drawback of some of these implementations is that the ODE Neural Networks are limited by the underlying distributions of the problems they are trying to solve (requirement of Lipschitz continuity, etc.). There are plenty of further advances to be made in this field as hundreds of years of ODE theory and literature is available, so this is currently an important area of research.<br />
<br />
ODEs indeed represent an important area of applied mathematics where neural networks can be used to solve them numerically. Perhaps, a parallel area of investigation can be PDEs (Partial Differential Equations). PDEs are also widely encountered in many areas of applied mathematics, physics, social sciences and many other fields. It will be interesting to see how neural networks can be used to solve PDEs.<br />
<br />
== More Critiques ==<br />
<br />
This paper covers the memory efficiency of Neural ODE Networks, but does not address runtime. In practice, most systems are bound by latency requirements more-so than memory requirements (except in edge device cases). Though it may be unreasonable to expect the authors to produce a performance-optimized implementation, it would be insightful to understand the computational bottlenecks so existing frameworks can take steps to address them. This model looks promising and practical performance is the key to enabling future research in this.<br />
<br />
The above critique also questions the need for a neural network for such a problem. This problem was studied by Brunel et al. and they presented their solution in their paper ''Parametric Estimation of Ordinary Differential Equations with Orthogonality Conditions''. While this solution also requires iteratively solving a complex optimization problem, they did not require the massive memory and runtime overhead of a neural network. For the neural network solution to demonstrate its potential, it should be including experimental comparisons with specialized ordinary differential equation algorithms instead of simply comparing with a general recurrent neural network.<br />
<br />
== References ==<br />
Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. ''arXiv preprint arXiv'':1710.10121, 2017.<br />
<br />
Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. ''Inverse Problems'', 34 (1):014004, 2017.<br />
<br />
Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. ''arXiv preprint arXiv'':1804.04272, 2018.<br />
<br />
Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. ''The mathematical theory of optimal processes''. 1962.<br />
<br />
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ''European conference on computer vision'', pages 630–645. Springer, 2016b.<br />
<br />
Earl A Coddington and Norman Levinson. ''Theory of ordinary differential equations''. Tata McGrawHill Education, 1955.<br />
<br />
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. ''arXiv preprint arXiv:1505.05770'', 2015.<br />
<br />
Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. ''arXiv preprint arXiv:1410.8516'', 2014.<br />
<br />
Brunel, N. J., Clairon, Q., & d’Alché-Buc, F. (2014). Parametric estimation of ordinary differential equations with orthogonality conditions. ''Journal of the American Statistical Association'', 109(505), 173-185.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Y93fang&diff=45026User:Y93fang2020-11-16T23:52:52Z<p>Y93fang: Y93fang moved page User:Y93fang to Summary for survey of neural networked-based cancer prediction models from microarray data</p>
<hr />
<div>#REDIRECT [[Summary for survey of neural networked-based cancer prediction models from microarray data]]</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=45025Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T23:52:52Z<p>Y93fang: Y93fang moved page User:Y93fang to Summary for survey of neural networked-based cancer prediction models from microarray data</p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:filtering gane.png]]<br />
<br />
All neural network’s neurons work as feature detectors that learn the input’s features. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithm such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy; meanwhile, clustering methods are unsupervised, which group similar samples or genes together. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group to a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
<br />
$$ RMSE = \sqrt{ \frac{\sum{(I(x)-O(x))^2}}{n} } $$<br />
<br />
$$ Logloss = \sum{(I(x)log(O(x)) + (1 - I(x))log(1 - O(x)))} $$<br />
<br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders and variational autoencoders. The architecture of the networks varies in many parameters, such as depth and loss function. Each example of an autoencoder mentioned above has different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient decent optimization, Adam optimizer).<br />
<br />
The neural network filtering methods were used by different statistical methods and classifiers. The methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoot or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification, or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
<br />
The codeword is a binary string C’k of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to map the input to the codeword iteratively, whose objective function is minimized in each iteration.<br />
<br />
Such cancer classifiers were applied on identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as classifier are better than those of SVM, logistic regression, naïve Bayes, classification trees and KNN.<br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without backpropagation mechanism, is one of the traditional model-based techniques to be applied on gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment, or both. To solve the high dimensionality problem, there are two methods: clustering ensembles by running a single clustering algorithm for several times, each of which has different initialization or number of parameters; and projective clustering by only considering a subset of the original features.<br />
<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were not easy to be obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken gene-to-cluster assignment into consideration.<br />
<br />
Also, the paper provides double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.<br />
<br />
== Summary ==<br />
<br />
Cancer is a disease with a very high fatality rate that spreads worldwide, and it’s essential to analyze gene expression for discovering genes abnormalities and increasing survivability as a consequence. The previous analysis in the paper reveals that neural networks are essentially used for filtering the gene expressions, predicting their class, or clustering them.<br />
<br />
Neural network filtering methods are used to reduce dimensionality of the gene expressions and remove their noise. In the article, the authors recommended deep architectures more than shallow architecture for best practise as they combine many nonlinearities. <br />
<br />
Neural network prediction methods can be used for both binary and multi-class problems. In binary cases, the network architecture has only one or two output neurons which diagnose a given sample as cancerous or non-cancerous, while the number of the output neurons in multi-class problems is equal to the number of classes. The authors suggested that the deep architecture with convolution layers which was the most recently used model, proved efficient capability and in predicting cancer subtypes as it captures the spatial correlations between gene expressions.<br />
Clustering is another analysis tool that is used to divide the gene expressions into groups. The authors indicated that a hybrid approach combining both the ensembling clustering and projective clustering is more accurate than using single-point clustering algorithm such as SOM since those methods do not have the capability to distinguish the noisy genes.<br />
<br />
==Discussion==<br />
There are some technical problems that can be considered and improved for building new models. <br><br />
<br />
1. Overfitting: Since gene expression datasets are high dimensional and have a relatively small number of samples, it would be likely to properly fits the training data but not accurate for test samples due to the lack of generalization capability. The ways to avoid overfitting can be: (1). adding weight penalties using regularization; (2). using the average predictions from many models trained on different datasets; (3). dropout. <br><br />
<br />
2. Model configuration and training: In order to reduce both the computational and memory expenses but also with high prediction accuracy, it’s crucial to properly set the network parameters. The possible ways can be: (1). proper initialization; (2). pruning the unimportant connections by removing the zero-valued neurons; (3). using ensemble learning framework by training different models using different parameter settings or using different parts of the dataset for each base model; (4). Using SMOTE for dealing with class imbalance on the high dimensional level. <br><br />
<br />
3. Model evaluation: Braga-Neto and Dougherty in their research revealed that cross-validation displayed excessive variance and therefore it is unreliable for small size data. The bootstrap method proved more accurate predictability.<br><br />
<br />
4. Study producibility: A study needs to be reproducible to enhance research reliability so that others can replicate the results using the same algorithms data and methodology. <br />
<br />
==Conclusion==<br />
This paper reviewed the most recent neural network-based cancer prediction models and gene expression analysis tools. The analysis indicates that the neural network methods are able to serve as filters, predictors, and clustering methods, and also showed that the role of the neural network determines its general architecture. To give suggestions for future neural network-based approaches, the authors highlighted some critical points that have to be considered such as overfitting and class imbalance, and suggest choosing different network parameters or combining two or more of the presented approaches.<br />
<br />
==Reference==<br />
Daoud, M., & Mayo, M. (2019). A survey of neural network-based cancer prediction models from microarray data. Artificial Intelligence in Medicine, 97, 204–214.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44946Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T18:03:42Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:filtering gane.png]]<br />
<br />
All neural network’s neurons work as feature detectors that learn the input’s features. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithm such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy. <br />
Clustering methods are unsupervised, which group similar samples or genes. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group to a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders and variational autoencoders. The architecture of the networks varies in depth, loss functions and other parameters. Each example of an autoencoder mentioned above has different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient decent optimization, Adam optimizer).<br />
The neural network filtering methods were used by different statistical methods and classifiers. The methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoot or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification, or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
The codeword is a binary string C’k of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to iteratively learn the mapping between the input and the codeword by minimizing the objective function, like the Mean Squared Error (MSE).<br />
Such cancer classifiers were applied on identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as classifier are better than those of SVM, logistic regression, naïve Bayes, classification trees and KNN. <br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without backpropagation mechanism, is one of the traditional model-based techniques to be applied on gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment, or both. To solve the high dimensionality problem, there are two methods: clustering ensembles by repeatedly running a single clustering algorithm with different initializations or numbers of parameters; and projective clustering by only considering a subset of the original features.<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were hardly obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken gene-to-cluster assignment into consideration. Also, the paper provides double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.<br />
<br />
== Summary ==<br />
<br />
Cancer is a disease with a very high fatality rate that spreads worldwide, and it’s essential to analyze gene expression for discovering genes abnormalities and increasing survivability as a consequence. The previous analysis in the paper reveals that neural networks are essentially used for filtering the gene expressions, predicting their class, or clustering them.<br />
<br />
Neural network filtering methods are used to reduce dimensionality of the gene expressions and remove their noise. In the article, the authors recommended deep architectures more than shallow architecture for best practise as they combine many nonlinearities. <br />
<br />
Neural network prediction methods can be used for both binary and multi-class problems. In binary cases, the network architecture has only one or two output neurons which diagnose a given sample as cancerous or non-cancerous, while the number of the output neurons in multi-class problems is equal to the number of classes. The authors suggested that the deep architecture with convolution layers which was the most recently used model, proved efficient capability and in predicting cancer subtypes as it captures the spatial correlations between gene expressions.<br />
Clustering is another analysis tool that is used to divide the gene expressions into groups. The authors indicated that a hybrid approach combining both the ensembling clustering and projective clustering is more accurate than using single-point clustering algorithm such as SOM since those methods do not have the capability to distinguish the noisy genes.<br />
<br />
==Discussion==<br />
There are some technical problems that can be considered and improved for building new models. <br><br />
<br />
1. Overfitting: Since gene expression datasets are high dimensional and have a relatively small number of samples, it would be likely to properly fits the training data but not accurate for test samples due to the lack of generalization capability. The ways to avoid overfitting can be: (1). adding weight penalties using regularization; (2). using the average predictions from many models trained on different datasets; (3). dropout. <br><br />
<br />
2. Model configuration and training: In order to reduce both the computational and memory expenses but also with high prediction accuracy, it’s crucial to properly set the network parameters. The possible ways can be: (1). proper initialization; (2). pruning the unimportant connections by removing the zero-valued neurons; (3). using ensemble learning framework by training different models using different parameter settings or using different parts of the dataset for each base model; (4). Using SMOTE for dealing with class imbalance on the high dimensional level. <br><br />
<br />
3. Model evaluation: Braga-Neto and Dougherty in their research revealed that cross-validation displayed excessive variance and therefore it is unreliable for small size data. The bootstrap method proved more accurate predictability.<br><br />
<br />
4. Study producibility: A study needs to be reproducible to enhance research reliability so that others can replicate the results using the same algorithms data and methodology. <br />
<br />
==Conclusion==<br />
This paper reviewed the most recent neural network-based cancer prediction models and gene expression analysis tools. The analysis indicates that the neural network methods are able to serve as filters, predictors, and clustering methods, and also showed that the role of the neural network determines its general architecture. To give suggestions for future neural network-based approaches, the authors highlighted some critical points that have to be considered such as overfitting and class imbalance, and suggest choosing different network parameters or combining two or more of the presented approaches.<br />
<br />
==Reference==<br />
Daoud, M., & Mayo, M. (2019). A survey of neural network-based cancer prediction models from microarray data. Artificial Intelligence in Medicine, 97, 204–214.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44945Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:58:19Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:filtering gane.png]]<br />
<br />
All neural network’s neurons work as feature detectors that learn the input’s features. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithm such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy. <br />
Clustering methods are unsupervised, which group similar samples or genes. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group to a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders and variational autoencoders. The architecture of the networks varies in depth, loss functions and other parameters. Each example of an autoencoder mentioned above has different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient decent optimization, Adam optimizer).<br />
The neural network filtering methods were used by different statistical methods and classifiers. The methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoot or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification, or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
The codeword is a binary string C’k of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to iteratively learn the mapping between the input and the codeword by minimizing the objective function, like the Mean Squared Error (MSE).<br />
Such cancer classifiers were applied on identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as classifier are better than those of SVM, logistic regression, naïve Bayes, classification trees and KNN. <br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without backpropagation mechanism, is one of the traditional model-based techniques to be applied on gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment, or both. To solve the high dimensionality problem, there are two methods: clustering ensembles by repeatedly running a single clustering algorithm with different initializations or numbers of parameters; and projective clustering by only considering a subset of the original features.<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were hardly obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken gene-to-cluster assignment into consideration. Also, the paper provides double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.<br />
<br />
== Summary ==<br />
<br />
Cancer is a disease with a very high fatality rate that spreads worldwide, and it’s essential to analyze gene expression for discovering genes abnormalities and increasing survivability as a consequence. The previous analysis in the paper reveals that neural networks are essentially used for filtering the gene expressions, predicting their class, or clustering them.<br />
<br />
Neural network filtering methods are used to reduce dimensionality of the gene expressions and remove their noise. In the article, the authors recommended deep architectures more than shallow architecture for best practise as they combine many nonlinearities. <br />
<br />
Neural network prediction methods can be used for both binary and multi-class problems. In binary cases, the network architecture has only one or two output neurons which diagnose a given sample as cancerous or non-cancerous, while the number of the output neurons in multi-class problems is equal to the number of classes. The authors suggested that the deep architecture with convolution layers which was the most recently used model, proved efficient capability and in predicting cancer subtypes as it captures the spatial correlations between gene expressions.<br />
Clustering is another analysis tool that is used to divide the gene expressions into groups. The authors indicated that a hybrid approach combining both the ensembling clustering and projective clustering is more accurate than using single-point clustering algorithm such as SOM since those methods do not have the capability to distinguish the noisy genes.<br />
<br />
==Discussion==<br />
There are some technical problems that can be considered and improved for building new models. <br><br />
<br />
1. Overfitting: Since gene expression datasets are high dimensional and have a relatively small number of samples, it would be likely to properly fits the training data but not accurate for test samples due to the lack of generalization capability. The ways to avoid overfitting can be: (1). adding weight penalties using regularization; (2). using the average predictions from many models trained on different datasets; (3). dropout. <br><br />
<br />
2. Model configuration and training: In order to reduce both the computational and memory expenses but also with high prediction accuracy, it’s crucial to properly set the network parameters. The possible ways can be: (1). proper initialization; (2). pruning the unimportant connections by removing the zero-valued neurons; (3). using ensemble learning framework by training different models using different parameter settings or using different parts of the dataset for each base model; (4). Using SMOTE for dealing with class imbalance on the high dimensional level. <br><br />
<br />
3. Model evaluation: Braga-Neto and Dougherty in their research revealed that cross-validation displayed excessive variance and therefore it is unreliable for small size data. The bootstrap method proved more accurate predictability.<br><br />
<br />
4. Study producibility: A study needs to be reproducible to enhance research reliability so that others can replicate the results using the same algorithms data and methodology. <br />
<br />
==Conclusion==<br />
This paper reviewed the most recent neural network-based cancer prediction models and gene expression analysis tools. The analysis indicates that the neural network methods are able to serve as filters, predictors, and clustering methods, and also showed that the role of the neural network determines its general architecture. To give suggestions for future neural network-based approaches, the authors highlighted some critical points that have to be considered such as overfitting and class imbalance, and suggest choosing different network parameters or combining two or more of the presented approaches.<br />
<br />
==Reference==</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44944Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:52:45Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:filtering gane.png]]<br />
<br />
All neural network’s neurons work as feature detectors that learn the input’s features. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithm such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy. <br />
Clustering methods are unsupervised, which group similar samples or genes. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group to a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders and variational autoencoders. The architecture of the networks varies in depth, loss functions and other parameters. Each example of an autoencoder mentioned above has different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient decent optimization, Adam optimizer).<br />
The neural network filtering methods were used by different statistical methods and classifiers. The methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoot or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification, or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
The codeword is a binary string C’k of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to iteratively learn the mapping between the input and the codeword by minimizing the objective function, like the Mean Squared Error (MSE).<br />
Such cancer classifiers were applied on identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as classifier are better than those of SVM, logistic regression, naïve Bayes, classification trees and KNN. <br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without backpropagation mechanism, is one of the traditional model-based techniques to be applied on gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment, or both. To solve the high dimensionality problem, there are two methods: clustering ensembles by repeatedly running a single clustering algorithm with different initializations or numbers of parameters; and projective clustering by only considering a subset of the original features.<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were hardly obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken gene-to-cluster assignment into consideration. Also, the paper provides double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.<br />
<br />
== Summary ==<br />
<br />
Cancer is a disease with a very high fatality rate that spreads worldwide, and it’s essential to analyze gene expression for discovering genes abnormalities and increasing survivability as a consequence. The previous analysis in the paper reveals that neural networks are essentially used for filtering the gene expressions, predicting their class, or clustering them.<br />
<br />
Neural network filtering methods are used to reduce dimensionality of the gene expressions and remove their noise. In the article, the authors recommended deep architectures more than shallow architecture for best practise as they combine many nonlinearities. <br />
<br />
Neural network prediction methods can be used for both binary and multi-class problems. In binary cases, the network architecture has only one or two output neurons which diagnose a given sample as cancerous or non-cancerous, while the number of the output neurons in multi-class problems is equal to the number of classes. The authors suggested that the deep architecture with convolution layers which was the most recently used model, proved efficient capability and in predicting cancer subtypes as it captures the spatial correlations between gene expressions.<br />
Clustering is another analysis tool that is used to divide the gene expressions into groups. The authors indicated that a hybrid approach combining both the ensembling clustering and projective clustering is more accurate than using single-point clustering algorithm such as SOM since those methods do not have the capability to distinguish the noisy genes.<br />
<br />
==Discussion==<br />
There are some technical problems that can be considered and improved for building new models. <br><br />
<br />
1. Overfitting: Since gene expression datasets are high dimensional and have a relatively small number of samples, it would be likely to properly fits the training data but not accurate for test samples due to the lack of generalization capability. The ways to avoid overfitting can be: (1). adding weight penalties using regularization; (2). using the average predictions from many models trained on different datasets; (3). dropout. <br><br />
<br />
2. Model configuration and training: In order to reduce both the computational and memory expenses but also with high prediction accuracy, it’s crucial to properly set the network parameters. The possible ways can be: (1). proper initialization; (2). pruning the unimportant connections by removing the zero-valued neurons; (3). using ensemble learning framework by training different models using different parameter settings or using different parts of the dataset for each base model; (4). Using SMOTE for dealing with class imbalance on the high dimensional level. <br><br />
<br />
3. Model evaluation: Braga-Neto and Dougherty in their research revealed that cross-validation displayed excessive variance and therefore it is unreliable for small size data. The bootstrap method proved more accurate predictability.<br><br />
<br />
4. Study producibility: A study needs to be reproducible to enhance research reliability so that others can replicate the results using the same algorithms data and methodology. <br />
<br />
==Conclusion==<br />
This paper reviewed the most recent neural network-based cancer prediction models and gene expression analysis tools. The analysis indicates that the neural network methods are able to serve as filters, predictors, and clustering methods, and also showed that the role of the neural network determines its general architecture. To give suggestions for future neural network-based approaches, the authors highlighted some critical points that have to be considered such as overfitting and class imbalance, and suggest choosing different network parameters or combining two or more of the presented approaches.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44943Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:50:14Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:filtering gane.png]]<br />
<br />
All neural network’s neurons work as feature detectors that learn the input’s features. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithm such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy. <br />
Clustering methods are unsupervised, which group similar samples or genes. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group to a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders and variational autoencoders. The architecture of the networks varies in depth, loss functions and other parameters. Each example of an autoencoder mentioned above has different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient decent optimization, Adam optimizer).<br />
The neural network filtering methods were used by different statistical methods and classifiers. The methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoot or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification, or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
The codeword is a binary string C’k of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to iteratively learn the mapping between the input and the codeword by minimizing the objective function, like the Mean Squared Error (MSE).<br />
Such cancer classifiers were applied on identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as classifier are better than those of SVM, logistic regression, naïve Bayes, classification trees and KNN. <br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without backpropagation mechanism, is one of the traditional model-based techniques to be applied on gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment, or both. To solve the high dimensionality problem, there are two methods: clustering ensembles by repeatedly running a single clustering algorithm with different initializations or numbers of parameters; and projective clustering by only considering a subset of the original features.<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were hardly obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken gene-to-cluster assignment into consideration. Also, the paper provides double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.<br />
<br />
== Summary ==<br />
<br />
Cancer is a disease with a very high fatality rate that spreads worldwide, and it’s essential to analyze gene expression for discovering genes abnormalities and increasing survivability as a consequence. The previous analysis in the paper reveals that neural networks are essentially used for filtering the gene expressions, predicting their class, or clustering them.<br />
<br />
Neural network filtering methods are used to reduce dimensionality of the gene expressions and remove their noise. In the article, the authors recommended deep architectures more than shallow architecture for best practise as they combine many nonlinearities. <br />
<br />
Neural network prediction methods can be used for both binary and multi-class problems. In binary cases, the network architecture has only one or two output neurons which diagnose a given sample as cancerous or non-cancerous, while the number of the output neurons in multi-class problems is equal to the number of classes. The authors suggested that the deep architecture with convolution layers which was the most recently used model, proved efficient capability and in predicting cancer subtypes as it captures the spatial correlations between gene expressions.<br />
Clustering is another analysis tool that is used to divide the gene expressions into groups. The authors indicated that a hybrid approach combining both the ensembling clustering and projective clustering is more accurate than using single-point clustering algorithm such as SOM since those methods do not have the capability to distinguish the noisy genes.<br />
<br />
==Discussion==<br />
There are some technical problems that can be considered and improved for building new models.<br />
<br />
1. Overfitting: Since gene expression datasets are high dimensional and have a relatively small number of samples, it would be likely to properly fits the training data but not accurate for test samples due to the lack of generalization capability. The ways to avoid overfitting can be: (1). adding weight penalties using regularization; (2). using the average predictions from many models trained on different datasets; (3). dropout.<br />
2. Model configuration and training: In order to reduce both the computational and memory expenses but also with high prediction accuracy, it’s crucial to properly set the network parameters. The possible ways can be: (1). proper initialization; (2). pruning the unimportant connections by removing the zero-valued neurons; (3). using ensemble learning framework by training different models using different parameter settings or using different parts of the dataset for each base model; (4). Using SMOTE for dealing with class imbalance on the high dimensional level. <br />
3. Model evaluation: Braga-Neto and Dougherty in their research revealed that cross-validation displayed excessive variance and therefore it is unreliable for small size data. The bootstrap method proved more accurate predictability.<br />
4. Study producibility: A study needs to be reproducible to enhance research reliability so that others can replicate the results using the same algorithms data and methodology. <br />
<br />
==Conclusion==<br />
This paper reviewed the most recent neural network-based cancer prediction models and gene expression analysis tools. The analysis indicates that the neural network methods are able to serve as filters, predictors, and clustering methods, and also showed that the role of the neural network determines its general architecture. To give suggestions for future neural network-based approaches, the authors highlighted some critical points that have to be considered such as overfitting and class imbalance, and suggest choosing different network parameters or combining two or more of the presented approaches.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44942Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:47:05Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:filtering gane.png]]<br />
<br />
All neural network’s neurons work as feature detectors that learn the input’s features. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithm such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy. <br />
Clustering methods are unsupervised, which group similar samples or genes. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group to a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders and variational autoencoders. The architecture of the networks varies in depth, loss functions and other parameters. Each example of an autoencoder mentioned above has different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient decent optimization, Adam optimizer).<br />
The neural network filtering methods were used by different statistical methods and classifiers. The methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoot or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification, or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
The codeword is a binary string C’k of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to iteratively learn the mapping between the input and the codeword by minimizing the objective function, like the Mean Squared Error (MSE).<br />
Such cancer classifiers were applied on identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as classifier are better than those of SVM, logistic regression, naïve Bayes, classification trees and KNN. <br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without backpropagation mechanism, is one of the traditional model-based techniques to be applied on gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment, or both. To solve the high dimensionality problem, there are two methods: clustering ensembles by repeatedly running a single clustering algorithm with different initializations or numbers of parameters; and projective clustering by only considering a subset of the original features.<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were hardly obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken gene-to-cluster assignment into consideration. Also, the paper provides double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44941Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:46:12Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:filter and cluster.png]]<br />
<br />
All neural network’s neurons work as feature detectors that learn the input’s features. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithm such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy. <br />
Clustering methods are unsupervised, which group similar samples or genes. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group to a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders and variational autoencoders. The architecture of the networks varies in depth, loss functions and other parameters. Each example of an autoencoder mentioned above has different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient decent optimization, Adam optimizer).<br />
The neural network filtering methods were used by different statistical methods and classifiers. The methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoot or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification, or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
The codeword is a binary string C’k of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to iteratively learn the mapping between the input and the codeword by minimizing the objective function, like the Mean Squared Error (MSE).<br />
Such cancer classifiers were applied on identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as classifier are better than those of SVM, logistic regression, naïve Bayes, classification trees and KNN. <br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without backpropagation mechanism, is one of the traditional model-based techniques to be applied on gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment, or both. To solve the high dimensionality problem, there are two methods: clustering ensembles by repeatedly running a single clustering algorithm with different initializations or numbers of parameters; and projective clustering by only considering a subset of the original features.<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were hardly obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken gene-to-cluster assignment into consideration. Also, the paper provides double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:filter_and_cluster.png&diff=44940File:filter and cluster.png2020-11-16T17:45:31Z<p>Y93fang: filter and cluster gene</p>
<hr />
<div>filter and cluster gene</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44939Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:45:05Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:filtering_gane]]<br />
<br />
All neural network’s neurons work as feature detectors that learn the input’s features. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithm such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy. <br />
Clustering methods are unsupervised, which group similar samples or genes. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group to a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders and variational autoencoders. The architecture of the networks varies in depth, loss functions and other parameters. Each example of an autoencoder mentioned above has different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient decent optimization, Adam optimizer).<br />
The neural network filtering methods were used by different statistical methods and classifiers. The methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoot or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification, or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
The codeword is a binary string C’k of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to iteratively learn the mapping between the input and the codeword by minimizing the objective function, like the Mean Squared Error (MSE).<br />
Such cancer classifiers were applied on identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as classifier are better than those of SVM, logistic regression, naïve Bayes, classification trees and KNN. <br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without backpropagation mechanism, is one of the traditional model-based techniques to be applied on gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment, or both. To solve the high dimensionality problem, there are two methods: clustering ensembles by repeatedly running a single clustering algorithm with different initializations or numbers of parameters; and projective clustering by only considering a subset of the original features.<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were hardly obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken gene-to-cluster assignment into consideration. Also, the paper provides double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44938Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:43:56Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:filtering_gane]]<br />
<br />
All neural network’s neurons work as feature detectors that learn the input’s features. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithm such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy. <br />
Clustering methods are unsupervised, which group similar samples or genes. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group to a new test set with unknown labels.<br />
<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders and variational autoencoders. The architecture of the networks varies in depth, loss functions and other parameters. Each example of an autoencoder mentioned above has different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient decent optimization, Adam optimizer).<br />
The neural network filtering methods were used by different statistical methods and classifiers. The methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoot or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification, or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
The codeword is a binary string C’k of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to iteratively learn the mapping between the input and the codeword by minimizing the objective function, like the Mean Squared Error (MSE).<br />
Such cancer classifiers were applied on identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as classifier are better than those of SVM, logistic regression, naïve Bayes, classification trees and KNN. <br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without backpropagation mechanism, is one of the traditional model-based techniques to be applied on gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment, or both. To solve the high dimensionality problem, there are two methods: clustering ensembles by repeatedly running a single clustering algorithm with different initializations or numbers of parameters; and projective clustering by only considering a subset of the original features.<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were hardly obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken gene-to-cluster assignment into consideration. Also, the paper provides double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44937Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:40:01Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:filtering_gene]]<br />
<br />
All neural network’s neurons work as feature detectors that learn the input’s features. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based''' <br></div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44936Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:36:31Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:filtering gene]]<br />
<br />
All neural network’s neurons work as feature detectors that learn the input’s features. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based''' <br></div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:filtering_gane.png&diff=44935File:filtering gane.png2020-11-16T17:33:27Z<p>Y93fang: </p>
<hr />
<div></div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44934Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:31:15Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:Example.jpg]]<br />
<br />
All neural network’s neurons work as feature detectors that learn the input’s features. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Datasets_and_preprocessing.png&diff=44933File:Datasets and preprocessing.png2020-11-16T17:29:13Z<p>Y93fang: </p>
<hr />
<div></div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44932Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:27:58Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Example.jpg]]</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44931Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:25:56Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:f1.png&diff=44930File:f1.png2020-11-16T17:24:34Z<p>Y93fang: Y93fang uploaded a new version of File:f1.png</p>
<hr />
<div></div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44929Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:21:09Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.jpg]]</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44928Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:15:15Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.</div>Y93fanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=44927Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-16T17:14:29Z<p>Y93fang: </p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases since it can help researchers to detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can understand better about the pathology of cancer. However, what might affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions. <br />
<br />
== Background == <br />
<br />
'''Neural Network'''\n<br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function. The difference between the output of the neural network and the desired output is what we called error.<br />
Backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br />
High dimensionality and the spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.</div>Y93fang