http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=X277lu&feedformat=atomstatwiki - User contributions [US]2024-03-29T15:13:11ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction&diff=49736Surround Vehicle Motion Prediction2020-12-07T04:26:49Z<p>X277lu: /* Introduction */</p>
<hr />
<div>DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''<br />
== Presented by == <br />
Mushi Wang, Siyuan Qiu, Yan Yu<br />
<br />
== Introduction ==<br />
<br />
This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory-based (LSTM) Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach to predict the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.<br />
<br />
== Previous Work ==<br />
The autonomous vehicle trajectory approaches previously used motion models like Constant Velocity and Constant Acceleration. These models are linear and are only able to handle straight motions. There are curvilinear models such as Constant Turn Rate and Velocity and Constant Turn Rate and Acceleration which handle rotations and more complex motions. Together with these models, Kalman Filter is used to predicting the vehicle trajectory. Kalman filtering is a common technique used in sensor fusion for state estimation that allows the vehicle's state to be predicted while taking into account the uncertainty associated with inputs and measurements. However, the performance of the Kalman Filter in predicting multi-step problems is not that good. Recurrent Neural Network performs significantly better than it. <br />
<br />
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. <br />
<br />
Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].<br />
<br />
== Motivation == <br />
Research results indicate that little research has been dedicated on predicting the trajectory of intersections. Research regarding intersections have previously concentrated only on maneuver-level predictions of activities such as crossing, stopping or making turns. Trajectory level prediction has also been mainly focused on structured environments that enforce classifiable behavior but has not been extensively researched in multi-lane turn intersections. This is important since different drivers have varied driving behaviors as they travel through intersections. <br />
Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.<br />
<br />
<center><br />
[[ File:intersection.png |300px]]<br />
</center><br />
<br />
== Framework == <br />
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.<br />
<br />
<center>[[Image:Figure1_Yan.png|800px|]]</center><br />
<br />
== LSTM-RNN based motion predictor == <br />
<br />
=== Sensor Outputs ===<br />
<br />
The input of the target perceptions is from the output of the sensors. The data collected in this article uses 6 different sensors with feature fusion to detect traffic in the range up to 100m: 1) LiDAR system outputs: Relative position, heading, velocity, and box size in local coordinates; 2) Around-View Monitoring (AVM) and 3)GPS outputs: acquire lanes, road marker, global position; 4) Gateway engine outputs: precise global position in urban road environment; 5) Micro-Autobox II and 6) a MDPS are used to control and actuate the subject. All data are stored in an industrial PC.<br />
<br />
=== Data ===<br />
Multi-lane turn intersections are the target roads in this paper. The dataset was collected using a human driven Autonomous Vehicle(AV) that was equipped with sensors to track motion the vehicle's surroundings. In addition, the motion sensors they used a front camera, Around-View-Monitor and GPS to acquire the lanes, road markers and global position. The data was collected in the urban roads of Gwanak-gu, Seoul, South Korea. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.<br />
<br />
=== Motion predictor ===<br />
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement, which is the sequential previous motion. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view. <br />
<br />
<br />
<center>[[Image:Figure7b_Yan.png|500px|]]</center><br />
<br />
<br />
==== Network architecture ==== <br />
A RNN is an artificial neural network that is suitable for use with sequential data because it has recurrent connections on its hidden nodes and thus, can retain its state or memory while processing the next input or sequence of inputs. For this reason, RNNs can be used to analyze time-series data where the pattern of the data depends on the time flow. This is an impossible task for traditional artificial neural networks, which assume the inputs are independent of one another. RNNs can also contain feedback loops that allow activations to flow alternately in the loop. <br />
<br />
In line with traditional neural networks, RNNs still suffer from the problem of vanishing gradients. An LSTM avoids this by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.<br />
<br />
<center>[[Image:Figure8_Yan.png|800px|]]</center><br />
<br />
==== Input and output features ==== <br />
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.<br />
<br />
==== Encoder and decoder ==== <br />
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit. <br />
==== Sequence length ==== <br />
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.<br />
<br />
== Motion planning based on surrounding vehicle motion prediction == <br />
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\<br />
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2 <br />
\end{split}<br />
\end{equation*}<br />
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles. <br />
The constraints of the control input are defined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\<br />
&||\mu(k+1|t) - \mu(k|t)|| \leq S<br />
\end{split}<br />
\end{equation*}<br />
Where <math>u_{min}</math>, <math>u_{max}</math>and S are the minimum/maximum control input and maximum slew rate of input respectively.<br />
<br />
Determine the position and speed boundary based on the predicted state:<br />
\begin{equation*}<br />
\begin{split}<br />
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\<br />
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0<br />
\end{split}<br />
\end{equation*}<br />
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.<br />
<br />
== Prediction performance analysis and application to motion planning ==<br />
=== Accuracy analysis ===<br />
The proposed algorithm was compared with the results from three base algorithms, a path-following model with <br />
constant velocity, a path-following model with traffic flow and a CTRV model.<br />
<br />
We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>, <br />
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:<br />
<br />
\begin{equation*}<br />
\begin{split}<br />
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\ <br />
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\ <br />
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\ <br />
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}<br />
\end{split}<br />
\end{equation*}<br />
<center>[[Image:Figure10.1_YanYu.png|500px|]]</center><br />
<br />
The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean, <br />
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped <br />
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers' <br />
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within <br />
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore, <br />
the proposed algorithm can be precise and maintain safety simultaneously.<br />
<br />
=== Motion planning application ===<br />
==== Case study of a multi-lane left turn scenario ====<br />
The proposed method mimics a human driver better, by simulating a human driver's decision-making process. <br />
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target vehicle, even when the target vehicle was not following the intersection guideline.<br />
<br />
==== Statistical analysis of motion planning application results ====<br />
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to <br />
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based <br />
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when <br />
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means <br />
that these cases took place sufficiently beyond the safety distance, and had little influence on determining <br />
the behaviour of the subject vehicle.<br />
<br />
<center>[[Image:Figure11_YanYu.png|500px|]]</center><br />
<br />
In order to compare the similarities between the results form the proposed algorithm and human driving decisions, <br />
this article introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math><br />
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm, <br />
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than the base <br />
algorithm. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm <br />
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed <br />
model is efficient and safe.<br />
<br />
== Conclusion ==<br />
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.<br />
<br />
== Future works ==<br />
This paper has identified several venues for future research, which include:<br />
<br />
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.<br />
<br />
2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.<br />
<br />
3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.<br />
<br />
4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.<br />
<br />
== Critiques ==<br />
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.<br />
<br />
This is an interesting topic to discuss. This is a major topic for some famous vehicle companies such as Tesla, which now already has a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.<br />
<br />
Autonomous driving is a very hot topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.<br />
<br />
There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.<br />
<br />
It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.<br />
<br />
The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.<br />
<br />
It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.<br />
<br />
The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.<br />
<br />
Understandably, a supervised learning problem should be evaluated on some test dataset. However, supervised learning techniques are inherently ill-suited for general planning problems. The test dataset was obtained from human driving data which is known to be extremely noisy as well as unpredictable when it comes to motion planning. It would be crucial to determine the successes of this paper based on the state-of-the-art reinforcement learning techniques.<br />
<br />
It would be better if the authors compared their method against other SOTA methods. Also one of the reasons motion planning is done using interpretable methods rather than black boxes (such as this model) is because it is hard to see where things go wrong and fix problems with the black box when they occur - this is something the authors should have also discussed.<br />
<br />
A future area of study is to combine other source of information such as signals from Lidar or car side cameras to make a better prediction model.<br />
<br />
It might be interesting and helpful to conduct some training and testing under different weather/environmental conditions, as it could provide more generalization to real-life driving scenarios. For example, foggy weather and evening (low light) conditions might affect the performance of sensors, and rainy weather might require a longer braking distance.<br />
<br />
This paper proposes an interesting, novel model prediction algorithm, using LSTM_RNN. However, since motion prediction in autonomous driving has great real-life impacts, I do believe that the evaluations of the algorithm should be more thorough. For example, more traditional motion planning algorithms such as multi-modal estimation and Kalman filters should be used as benchmarks. Moreover, the experiment results are based on Korean driving conditions only. Eastern and Western drivers can have very different driving patterns, so that should be addressed in the discussion section of the paper as well.<br />
<br />
The paper mentions that in the future, this research plans to learn the real life behaviour of automated vehicles. Seeing a possible improvement in road safety due to this research will be very interesting.<br />
<br />
This predictor is also possible to be applied in the traffic control system.<br />
<br />
This prediction model should consider various conditions that could happen in an intersection. However, normal prediction may not work when there is a traffic jam or in some crowded time periods like rush hours.<br />
<br />
It would be better that the author could provide more comparison between the LSTN-RNN algorithm and other traditional algorithm such as RNN or just LSTM.<br />
<br />
The paper has really good results for what they aimed to achieve. However for the future work it would also be nice to have various climates/weathers to be included in the Seoul dataset. I think it's also important to consider it as different climates/weather (such as snowy roads, or rain) would introduce more noisier data (camera's image processing) and the human drivers behaviour would change as well to adapt to the new environment.<br />
<br />
It would be good to have a future work section to discusses shortage of current algorithms and the possible improvement.<br />
<br />
The summary explains the whole process well, but is missing the small details among the steps. It would be better to explain concepts such as RNN, modelling procedure for first time users.<br />
<br />
This paper presents a nice method, but does not seem particularly well developed. I would have liked to see some more ablations on this particular choice of RNN, as there are more efficient variants such as GRU which show similar performance in other tasks while being more amenable to real-time inference. Furthermore, the multi-model aspect seems slightly ad-hoc, it would have been nice to see a more rigorous formulation similar to seen in some recent work by Zeng et al. from Uber ATG: https://arxiv.org/pdf/2008.06041.pdf.<br />
<br />
The data used for this paper contains driver information exclusively to the urban roads of Gwanak-gu Seoul, hence the data may contain an inherited bias as drivers around the rest of the country, let alone the rest of the world, will have different habits based on different environments. It would be interesting to see if this model can be applied to other cities around the world and exhibit similar results or would there be a need to tune it based off geographic location.<br />
<br />
Since the data is based on urban roads, It would be better to include the details on performance of the model on high traffic area vs low traffic urban area. It would also be interesting to see the performance of the model with many pedestrians.<br />
<br />
While it would be nice to read more on why the authors chose LSTM-RNN, the paper exhibits a potential way to improve autonomous vehicle performance. It would be interesting to see how an army of robots would behave when this paper's method is applied in robotics, since robots' motions also follow a trajectory.<br />
<br />
An interesting topic, but the paper and accompanying summary are missing some details that would improve understandability. With respect to the background of the topic, a more detailed explanation of trajectories in the case of driving would help to better motivate the research. The addition of benchmarks and comparisons to current industry standards (if they are published publicly) would help to contextualize the results of the LSTM-RNN. An area of further study is applying these techniques to different weather situations and driving patterns in different countries. How does this model perform in regions where driver's very loosely follow the laws of the road? Further, could the research be generalized for controlled and uncontrolled turn lanes, especially on roads with higher speed limits?<br />
<br />
In the past, LSTM is a very popular model in natural language. It is interesting to see how it is used for motion prediction. Although the result is not good enough, it is still a good start. In the future, more language models can be used to see how well they can perform to do a comparison such as BiLSTM, Transformer, and Bert.<br />
<br />
LSTM-RNNs are really incredibly useful for sequential learning. The RNN should be able to predict, not only when a vehicle in front of the motion sensors, but it's likelihood of staying in motion vs coming to a sudden stop. There is no details in this review on the running time of the algorithm. In a field where split seconds could be the difference between an accident and a brake, running time cannot be stressed enough. Whether or not it is truly capable of dealing with real time traffic cannot be judged without relevant data and results being shown.<br />
<br />
== Reference ==<br />
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.<br />
<br />
[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.<br />
<br />
[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.<br />
<br />
[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.<br />
<br />
[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].<br />
<br />
[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_Networks&diff=49733Graph Structure of Neural Networks2020-12-07T04:20:21Z<p>X277lu: /* Introduction */</p>
<hr />
<div>= Presented By =<br />
<br />
Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang<br />
<br />
= Introduction =<br />
<br />
A deep neural network is composed of neurons organized into layers and the connections between them. The architecture of a neural network can be captured by its "computational graph", where neurons are represented as nodes that perform mathematical computations, and directed edges that link neurons in different layers. This graphical representation demonstrates how the network transmits and transforms information through its input neurons through the hidden layers and ultimately to the output neurons. <br />
<br />
However, few researchers have focused on the relationship between neural networks and their predictive performance, which is the main focus of this paper. The aim is to help explain how the addition or deletion of layers, their links and the number of nodes can impact the performance of the network.<br />
<br />
In Neural Network research, it is often important to build a relation between a neural network’s accuracy and its underlying graph structure. It would contribute in building more efficient and more accurate neural network architectures. A natural choice is to use a computational graph representation. Doing so, however, has many limitations, including: <br />
<br />
(1) Lack of generality: Computational graphs have to follow allowed graph properties, which limits the use of the rich tools developed for general graphs. <br />
<br />
(2) Disconnection with biology/neuroscience: Biological neural networks have a much more complicated and less standardized structure. For example, there might be information exchanges in the brain networks. It is difficult to represent these models with directed acyclic graphs. This disconnection between biology/neuroscience makes knowledge less transferable and interdisciplinary research more difficult.<br />
<br />
Thus, the authors developed a new way of representing a neural network as a graph, called a relational graph. The key insight in the new representation is to focus on message exchange, rather than just on directed data flow. For example, for a fixed-width fully-connected layer, an input channel and output channel pair can be represented as a single node, while an edge in the relational graph can represent the message exchange between the two nodes. Under this formulation, using the appropriate message exchange definition, it can be shown that the relational graph can represent many types of neural network layers.<br />
<br />
WS-flex is a graph generator that allows systematically exploring the design space of neural networks. Neural networks are characterized by the clustering coefficient and average path length of their relational graphs under the insights of neuroscience. Based on the insights from neuroscience, the authors characterize neural networks by the clustering coefficient and average path lengths of their relational graphs.<br />
<br />
= Neural Network as Relational Graph =<br />
<br />
The author proposes the concept of relational graphs to study the graphical structure of neural network. Each relational graph is based on an undirected graph <math>G =(V; E)</math>, where <math>V =\{v_1,...,v_n\}</math> is the set of all the nodes, and <math>E \subseteq \{(v_i,v_j)|v_i,v_j\in V\}</math> is the set of all edges that connect nodes. Note that for the graph used here, all nodes have self edges, that is <math>(v_i,v_i)\in E</math>. Graphically, a self edge connects an edge to itself, which forms a loop.<br />
<br />
To build a relational graph that captures the message exchange between neurons in the network, we associate various mathematical quantities to the graph <math>G</math>. First, a feature quantity <math>x_v</math> is associated with each node. The quantity <math>x_v</math> might be a scalar, vector or tensor depending on what type of neural network it is (see the Table at the end of the section). Then a message function <math>f_{uv}(·)</math> is associated with every edge in the graph. A message function specifically takes a node’s feature as the input and then outputs a message. An aggregation function <math>{\rm AGG}_v(·)</math> then takes a set of messages (the outputs of the message function) and outputs the updated node feature. <br />
<br />
A relational graph is a graph <math>G</math> associated with several message exchange rounds, which transform the feature quantity <math>x_v</math> with the message function <math>f_{uv}(·)</math> and the aggregation function <math>{\rm AGG}_v(·)</math>. At each round of message exchange, each node sends messages to its neighbors and aggregates incoming messages from its neighbors. Each message is transformed at each edge through the message function, then they are aggregated at each node via the aggregation function. Suppose we have already conducted <math>r-1</math> rounds of message exchange, then the <math>r^{th}</math> round of message exchange for a node <math>v</math> can be described as<br />
<br />
<div style="text-align:center;"><math>\mathbf{x}_v^{(r+1)}= {\rm AGG}^{(r)}(\{f_v^{(r)}(\textbf{x}_u^{(r)}), \forall u\in N(v)\})</math></div> <br />
<br />
where <math>\mathbf{x}^{(r+1)}</math> is the feature of the <math>v</math> node in the relational graph after the <math>r^{th}</math> round of update. <math>u,v</math> are nodes in Graph <math>G</math>. <math>N(u)=\{u|(u,v)\in E\}</math> is the set of all the neighbor nodes of <math>u</math> in graph <math>G</math>.<br />
<br />
To further illustrate the above, we use the basic Multilayer Perceptron (MLP) as an example. An MLP consists of layers of neurons, where each neuron performs a weighted sum over scalar inputs and outputs, followed by some non-linearity. Suppose the <math>r^{th}</math> layer of an MLP takes <math>x^{(r)}</math> as input and <math>x^{(r+1)}</math> as output, then a neuron computes <br />
<br />
<div style="text-align:center;"><math>x_i^{(r+1)}= \sigma(\Sigma_jw_{ij}^{(r)}x_j^{(r)})</math>.</div> <br />
<br />
where <math>w_{ij}^{(r)}</math> is the trainable weight and <math>\sigma</math> is the non-linearity function. Let's first consider the special case where the input and output of all the layers <math>x^{(r)}</math>, <math>1 \leq r \leq R </math> have the same feature dimensions <math>d</math>. In this scenario, we can have <math>d</math> nodes in the Graph <math>G</math> with each node representing a neuron in MLP. Each layer of neural network will correspond with a round of message exchange, so there will be <math>R</math> rounds of message exchange in total. The aggregation function here will be the summation with non-linearity transform <math>\sigma(\Sigma)</math>, while the message function is simply the scalar multipication with weight. A fully-connected, fixed-width MLP layer can then be expressed with a complete relational graph, where each node <math>x_v</math> connects to all the other nodes in <math>G</math>, that is neighborhood set <math>N(v) = V</math> for each node <math>v</math>. The figure below shows the correspondence between the complete relation graph with a 5-layer 4-dimension fully-connected MLP.<br />
<br />
<div style="text-align:center;">[[File:fully_connnected_MLP.png]]</div><br />
<br />
In fact, a fixed-width fully-connected MLP is only a special case under a much more general model family, where the message function, aggregation function, and most importantly, the relation graph structure can vary. The different relational graph will represent the different topological structure and information exchange pattern of the network, which is the property that the paper wants to examine. The plot below shows two examples of non-fully connected fixed-width MLP and their corresponding relational graphs. <br />
<br />
<div style="text-align:center;">[[File:otherMLP.png]]</div><br />
<br />
We can generalize the above definitions for fixed-width MLP to Variable-width MLP, Convolutional Neural Network (CNN), and other modern network architecture like Resnet by allowing the node feature quantity <math>\textbf{x}_j^{(r)}</math> to be a vector or tensor respectively. In this case, each node in the relational graph will represent multiple neurons in the network, and the number of neurons contained in each node at each round of message exchange does not need to be the same, which gives us a flexible representation of different neural network architecture. The message function will then change from the simple scalar multiplication to either matrix/tensor multiplication or convolution. And the weight matrix will change to the convolutional filter. The representation of these more complicated networks are described in details in the paper, and the correspondence between different networks and their relational graph properties is summarized in the table below. <br />
<br />
<div style="text-align:center;">[[File:relational_specification.png]]</div><br />
<br />
Overall, relational graphs provide a general representation for neural networks. With proper definitions of node features and message exchange, relational graphs can represent diverse neural architectures, thereby allowing us to study the performance of different graph structures.<br />
<br />
= Exploring and Generating Relational Graphs=<br />
<br />
We will deal with the design and how to explore the space of relational graphs in this section. There are three parts we need to consider:<br />
<br />
(1) '''Graph measures''' that characterize graph structural properties:<br />
<br />
We will use one global graph measure, average path length, and one local graph measure, clustering coefficient.<br />
To explain clearly, average path length measures the average shortest path distance between any pair of nodes. The clustering coefficient measures the proportion of edges between the nodes within a given node’s neighborhood, divided by the number of edges that could possibly exist between them, averaged over all nodes.<br />
<br />
(2) '''Graph generators''' that can generate the diverse graph:<br />
<br />
With selected graph measures, we use a graph generator to generate diverse graphs to cover a large span of graph measures. To figure out the limitation of the graph generator and find out the best, we investigate some generators including ER, WS, BA, Harary, Ring, Complete graph and results are shown below:<br />
<br />
<div style="text-align:center;">[[File:3.2 graph generator.png]]</div><br />
<br />
Thus, from the picture, we could obtain the WS-flex graph generator that can generate graphs with a wide coverage of graph measures. Notably, WS-flex graphs almost encompass all the graphs generated by classic random generators mentioned above.<br />
<br />
(3) '''Computational Budget''' that we need to control so that the differences in performance of different neural networks are due to their diverse relational graph structures.<br />
<br />
It is important to ensure that all networks have approximately the same complexities so that the differences in performance are due to their relational graph structures when comparing neutral work by their diverse graph.<br />
<br />
We use floating point operations (FLOPS) which measures the number of multiply-adds as the metric. We first compute the FLOPS of our baseline network instantiations (i.e. complete relational graph) and use them as the reference complexity in each experiment. From the description in section 2, a relational graph structure can be instantiated as a neural network with variable width. Therefore, we can adjust the width of a neural network to match the reference complexity without changing the relational graph structures.<br />
<br />
= Experimental Setup =<br />
The author studied the performance of 3942 sampled relational graphs (generated by WS-flex from the last section) of 64 nodes with two experiments: <br />
<br />
(1) CIFAR-10 dataset: 10 classes, 50K training images, and 10K validation images<br />
<br />
Relational Graph: all 3942 sampled relational graphs of 64 nodes<br />
<br />
Studied Network: 5-layer MLP with 512 hidden units<br />
<br />
The model was trained for 200 epochs with batch size 128, using cosine learning rate schedule with an initial learning rate of 0.1<br />
<br />
<br />
(2) ImageNet classification: 1K image classes, 1.28M training images and 50K validation images<br />
<br />
Relational Graph: Due to high computational cost, 52 graphs are uniformly sampled from the 3942 available graphs.<br />
<br />
Studied Network: <br />
*ResNet-34, which only consists of basic blocks of 3×3 convolutions (He et al., 2016)<br />
<br />
*ResNet-34-sep, a variant where we replace all 3×3 dense convolutions in ResNet-34 with 3×3 separable convolutions (Chollet, 2017)<br />
<br />
*ResNet-50, which consists of bottleneck blocks (He et al., 2016) of 1×1, 3×3, 1×1 convolutions<br />
<br />
*EfficientNet-B0 architecture (Tan & Le, 2019)<br />
<br />
*8-layer CNN with 3×3 convolution<br />
<br />
= Results and Discussions =<br />
<br />
The paper summarizes the result of the experiment among multiple different relational graphs through sampling and analyzing and list six important observations during the experiments, These are:<br />
<br />
* There always exists a graph structure that has higher predictive accuracy under Top-1 error compared to the complete graph<br />
<br />
* There is a sweet spot such that the graph structure near the sweet spot usually outperforms the base graph<br />
<br />
* The predictive accuracy under top-1 error can be represented by a smooth function of Average Path Length <math> (L) </math> and Clustering Coefficient <math> (C) </math><br />
<br />
* The Experiments are consistent across multiple datasets and multiple graph structures with similar Average Path Length and Clustering Coefficient.<br />
<br />
* The best graph structure can be identified easily.<br />
<br />
* There are similarities between the best artificial neurons and biological neurons.<br />
<br />
<br />
<br />
[[File:Result2_441_2020Group16.png |1000px]]<br />
<br />
$$\text{Figure - Results from Experiments}$$<br />
<br />
<br />
'''Neural networks performance depends on its structure'''<br />
<br />
During the experiment, Top-1 errors for all sampled relational graph among multiple tasks and graph structures are recorded. The parameters of the models are average path length and clustering coefficient. Heat maps were created to illustrate the difference in predictive performance among possible average path length and clustering coefficient. In '''Figure - Results from Experiments (a)(c)(f)''', The darker area represents a smaller top-1 error which indicates the model performs better than the light area.<br />
<br />
Compared to the complete graph which has parameter <math> L = 1 </math> and <math> C = 1 </math>, the best performing relational graph can outperform the complete graph baseline by 1.4% top-1 error for MLP on CIFAR-10, and 0.5% to 1.2% for models on ImageNet. Hence it is an indicator that the predictive performance of the neural networks highly depends on the graph structure, or equivalently that the completed graph does not always have the best performance.<br />
<br />
<br />
'''Sweet spot where performance is significantly improved'''<br />
<br />
It had been recognized that training noises often results in inconsistent predictive results. In the paper, the 3942 graphs in the sample had been grouped into 52 bins, each bin had been colored based on the average performance of graphs that fall into the bin. By taking the average, the training noises had been significantly reduced. Based on the heat map '''Figure - Results from Experiments (f)''', the well-performing graphs tend to cluster into a special spot that the paper called “sweet spot” shown in the red rectangle, the rectangle is approximately included clustering coefficient in the range <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math>.<br />
<br />
<br />
'''Relationship between neural network’s performance and parameters'''<br />
<br />
When we visualize the heat map, we can see that there is no significant jump of performance that occurred as a small change of clustering coefficient and average path length ('''Figure - Results from Experiments (a)(c)(f)'''). In addition, if one of the variables is fixed in a small range, it is observed that a second-degree polynomial can be used to fit and approximate the data sufficiently well ('''Figure - Results from Experiments (b)(d)'''). Therefore, both the clustering coefficient and average path length are highly related to neural network performance by a convex U-shape. <br />
<br />
<br />
'''Consistency among many different tasks and datasets'''<br />
<br />
They observe that relational graphs with certain graph measures may consistently perform well regardless of how they are instantiated. The paper presents consistency uses two perspectives, one is qualitative consistency and another one is quantitative consistency.<br />
<br />
(1) '''Qualitative Consistency'''<br />
It is observed that the results are consistent from different points of view. Among multiple architecture dataset, it is observed that the clustering coefficient within <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math> consistently outperform the baseline complete graph. <br />
<br />
(2) '''Quantitative Consistency'''<br />
Among different dataset with the network that has similar clustering coefficient and average path length, the results are correlated, The paper mentioned that ResNet-34 is much more complex than 5-layer MLP but a fixed set relational graph would perform similarly in both settings, with Pearson correlation of <math>0.658</math>, the p-value for the Null hypothesis is less than <math>10^{-8}</math>.<br />
<br />
<br />
'''Top architectures can be identified efficiently'''<br />
<br />
The computation cost of finding top architectures can be significantly reduced without training the entire data set for a large value of epoch or a relatively large sample. To achieve the top architectures, the number of graphs and training epochs need to be identified. For the number of graphs, a heatmap is a great tool to demonstrate the result. In the 5-layer MLP on CIFAR-10 example, taking a sample of the data around 52 graphs would have a correlation of 0.9, which indicates that fewer samples are needed for a similar analysis in practice. When determining the number of epochs, correlation can help to show the result. In ResNet34 on ImageNet example, the correlation between the variables is already high enough for future computation within 3 epochs. This means that good relational graphs perform well even at the<br />
initial training epochs.<br />
<br />
<br />
'''Well-performing neural networks have graph structure surprisingly similar to those of real biological neural networks'''<br />
<br />
The way we define relational graphs and average length in the graph is similar to the way information is exchanged in network science. The biological neural network also has a similar relational graph representation and graph measure with the best-performing relational graph.<br />
<br />
While there is some organizational similarity between a computational neural network and a biological neural network, we should refrain from saying that both these networks share many similarities or are essentially the same with just different substrates. The biological neurons are still quite poorly understood and it may take a while before their mechanisms are better understood.<br />
<br />
= Conclusions=<br />
Our works provided a new viewpoint about combining the fields of graph neural networks (GNNs) and general architecture design by establishing graph structures. The authors believe that the model helps to capture the diverse neural network architectures under a unified framework. In particular, GNNs are instances of general neural architectures where graph structures are seen as input rather than part of the architecture and message functions are shared across all edges of the graph. We found that the graph theories and techniques implemented in other disciplines have the capability to provide a good reference for understanding and designing the structures and functions of neural networks. This could provide effective help and inspiration for the study of neural networks.<br />
<br />
= Critique =<br />
<br />
1. The experiment is only measuring on a single data set which might not be representative enough. As we can see in the whole paper, the "sweet spot" we talked about might be a special feature for the given data set only which is the CIFAR-10 data set. If we change the data set to another imaging data set like CK+, whether we are going to get a similar result is not shown by the paper. Hence, the result that is being concluded from the paper might not be representative enough. <br />
<br />
2. When we are fitting the model in practice, we will fit the model with more than one epoch. The order of the model fitting should be randomized since we should create more random jumps to avoid staying inside a local minimum. With the same order within each epoch, the data might be grouped by different classes or levels, the model might result in a better performance with certain classes and worse performance with other classes. In this particular example, without randomization of the training data, the conclusion might not be precise enough.<br />
<br />
3. This study shows empirical justification for choosing well-performing models from graphs differing only by average path length and clustering coefficient. An equally important question is whether there is a theoretical justification for why these graph properties may (or may not) contribute to the performance of a general classifier - for example, is there a combination of these properties that is sufficient to recover the universality theorem for MLP's?<br />
<br />
4. It might be worth looking into how to identify the "sweet spot" for different datasets.<br />
<br />
5. What would be considered a "best graph structure " in the discussion and conclusion part? It seems that the intermediate result of getting an accurate result was by binning graphs into smaller bins, what should we do if the graphs can not be binned into significantly smaller bins in order to proceed with the methodologies mentioned in the paper. Both CIFAR - 10 and ImageNet seem too trivial considering the amount of variation and categories in the dataset. What would the generalizability be to other presentations of images?<br />
<br />
6. There is an interesting insight that the idea of the relational graph is similar to applying causal graphs in neuro networks, which is also closer to biology and neuroscience because human beings learning things based on causality. This new approach may lead to higher prediction accuracy but it needs more assumptions, such as correct relations and causalities.<br />
<br />
7. This is an interesting topic that uses the knowledge in graph theory to introduce this new structure of Neural Networks. Using more data sets to discuss this approach might be more interesting, such as the MNIST dataset. We think it is interesting to discuss whether this structure will provide a better performance compare to the "traditional" structure of NN in any type of Neural Networks.<br />
<br />
8. The key idea is to treat a neural network as a relational graph and investigate the messages exchange over the graphs. It will be interesting to discuss how much improvement a sweet spot can bring to the network.<br />
<br />
9. It is interesting to see how figure "b" in the plots showing the experiment results has a U-shape. This shows a correlation between message exchange efficiency and capability of learning distributed representations, giving an idea about the tradeoff between them, as indicated in the paper.<br />
<br />
10. From Results and Discussion part, we can find learning at a very abstract level of artificial neurons and biological neurons is similar. But the learning method is different. Artificial neural networks use gradient descent to minimize the loss function and reach the global minimum. Biological neural networks have another learning method.<br />
<br />
11. It would be interesting to see if the insights from this paper can be used to create new kinds of neural network architectures which were previously unknown, or to create a method of deciding which neural network to use for a given situation or data set. It may be interesting to compare the performance of neural networks by a property of their graph (like average path length as described in this paper) versus a property of the data (such as noisiness).<br />
<br />
12. It would be attractive if the paper dig more into the “sweet pot” and give us a brief introduction of it rather than just slightly mention it. It would be a little bit confused if this technical term shows up and without any previous mention or introduction<br />
<br />
13. The paper shows graphical results how well performing graphs cluster into ‘sweet spots’. And the ‘sweet spot’ leads to significant performance improvement of the graph. It would be interesting if the author could provide more justification for the experimental setup chosen and also explore if known graph theorems have any relation to this improved performance.<br />
<br />
14. GNN is really interesting, one application I can think of is it can be used for recommending system like Linkedin and Facebook. Since people who knows each other is really easy represented as graph.<br />
<br />
15. At the end of the paper, it could also talk about the applications of relational graph, the developed graph-based representation of neural networks. For example, structural scenarios that data has relational structure like atoms and molecules, and non-structural scenarios that data doesn’t have relational structure like images and texts.</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=49729Task Understanding from Confusing Multi-task Data2020-12-07T04:15:33Z<p>X277lu: /* Introduction */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, assisting doctors to make data-driven decisions, and even self-driving cars. One of the most famous integrated forms of Narrow AI is Apple's Siri. Siri has no self-awareness or genuine intelligence, and hence often has challenges performing tasks outside its range of abilities. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. For an isolated and very difficult task, the artificial intelligence may not learn it very well. For instance, a net with pixel dimension 1000*1000 is less likely to identify complicated objects in real-world situations on the time basis. However, if it could be learned simultaneously, it would be better as the tasks can share what they learned. It is easier for the learner to learn together instead of in isolation, for example, shapes, landmarks, textures, orientation and so on. This is called Multitask Learning. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input values. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes two functions: de-confusing function and mapping function. The de-confusing function allocates samples to respective tasks and the mapping function presents the relation from the input to its label within the allocated tasks. See figure 1(b). To implement the CSL, we use a risk functional to balance the effects of the de-confusing function and mapping function. <br />
<br />
However, simply combining the two functions or networks to a single architecture is impossible, since the one-hot constraint of the outputs for the de-confusing network makes the gradient back-propagation unfeasible. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization in the proposed architecture CLS-Net.<br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine assigned with complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, we know that samples are generated from multiple distinct distributions instead of one distribution combining a mixture of multiple probability models. Thus, the latent variable learning can not fully distinguish labels into different tasks and different distributions, and it is insufficient to classify the multi-task confusing samples. <br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improving the overall learning efficiency since the labels in different tasks are often correlated: improving the classification result for one class also helps with other classification tasks. In multi-task learning, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. If such manual task annotation is absent, then the algorithm can not be performed.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, let <math> (x,y)</math> be the training samples from <math>y=f(x)</math>, which is an identical but unknown mapping relationship. Assuming the risk measure is mean squared error (MSE), the expected risk function is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the data distribution of the input variable <math>x</math>. In practice, the methods select the optimal function by minimizing the empirical risk:<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
To minimize the risk function, the theoretically optimal solution is <math> f(x) </math>.<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk function can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math>. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. The solution represents a mixed probably model instead of knowing the exact tasks and their correpsonding individual probability distribution. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to the output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (using MSE loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
The risk metric of every sample affects only its assigned task.<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
However, necessity constraints are needed to avoid meaningless trivial solutions in all optimal risk solutions.<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
This theorem shows the method of empirical risk minimization is valid in the CSL framework. Moreover, the assumed number of tasks affects the VC dimension of the learning functions, which is positively related to the generalization error. Therefore, to make the training risk small, we need to choose the ''minimum number'' of tasks when determining the task.<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL, including the stucture of CSL-Net and Iterative deconfusing algorithm.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to reproduce both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of digit data in a range of 0 to 9, each of which is in a single color among the eight different colors. Each observation in this modified set consists of a colored image (<math>x_i</math>) and a label (<math>y_i</math>) that represents either the corresponding color, or the digit. The goal is to reproduce the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': The second image classification data set consists of several fashion-related objects labeled from any of the 3 criteria: “gender”, “category”, and “main color”, whose number of observations is larger than that of the "colored-MNIST" data set.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks. The CSL methods autonomously learned three tasks which corresponded exactly to “Gender”,<br />
“Category”, and “Color” as we see it.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
The purpose of this measure arises from the fact that, in addition to learning mapping allocations like humans, machines should be able to approximate all mapping functions accurately in order to provide corresponding labels. The Label Prediction Accuracy measure captures the exchange equivalence of the following task: each mapping contains its ground-truth output, and machines should be predicting the correct output that is close to the ground-truth. <br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': To "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well in learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labelling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
<br />
Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reactions from the text.<br />
<br />
Multi-label can be used to improve the syndrome diagnosis of a patient by focusing on multiple syndromes instead of a single syndrome.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem without manual task annotations from basic input data. The model obtains a basic task concept by learning the minimum risk for confusing samples from differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, some limitations can be improved for future work:<br />
<br />
- The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; <br />
<br />
- The current algorithm is difficult to learn basic features directly through a CNN structure and understand tasks simultaneously by training a full-connect network. However, this limitation does not affect the effectiveness of our algorithm in learning confusing data based on pre-trained features.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
Compared to the standard supervised learning, Multi-label learning can associate a training sample with multiple category tags at the same time. It can assign multiple labels to some hidden instances and can be reduced to standard supervised learning by limiting the number of class labels per instance. <br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed to converge because of the strong correlation in prediction between the two networks.<br />
<br />
It would be interesting to see this implemented in more examples, to test the robustness of different types of data. The validation tasks chosen by data are all very simple, and CSL is actually not necessary for those tasks. For the colored MNIST data, a simple function can be written to distinguish the color label from the number label. The same problem applied to the Kaggle Fashion product dataset. The candidate label can be easily classified into different tasks by some wording analysis or meaning classification program or even manual classification. Even though the idea discussed by authors are interesting, the examples suggested by authors seem to suggest very limited or even unnecessary application. In most cases, it is more beneficial to treat the Confusing Multi-task Data problems separately into two distinct stages: we classify the tasks first according to the meaning of the label, and then we perform a multi-class/multi-label training process.<br />
<br />
When using this framework for classification, the order of the one-hot classification labels for each task will likely influence the relationships learned between each task, since the same output header is used for all tasks. This may be why this method fails to learn low-level representations and requires pretraining. I would like to see more explanation in the paper about why this isn't a problem if it was investigated.<br />
<br />
It would be a good idea to include comparison details in the summary to make the results and the conclusion more convincing. For instance, though the paper introduced the result generated using confusion data, and provide some applications for multi-label learning, these two sections still fell short and could use some technical details as supporting evidence.<br />
<br />
It is interesting to investigate if the order of adding tasks will influence the model performance.<br />
<br />
It would be interesting to see the effectiveness of applying CSL in face recognition, such that not only does the algorithm map the face to identity, it also categorizes the face based on other features like beard/no beard and glasses/no glasses simultaneously.<br />
<br />
It would be better for the researchers to compare the efficiency of this approach with other models.<br />
<br />
For pattern recognition,pre-trained features were used in the algorithm. It would be interesting to see how the effectiveness of the model changes if we train it with data directly from the CNN structure in the future.<br />
<br />
Basically given a confused dataset CSL finds the important tasks or labels from the dataset as can be seen from the fruit example. In the example, fruits are grouped under their names, their tastes, and their color, when CSL is given a mixed dataset. Hence given an unstructured data, unlabeled, confused dataset CSL helps in finding the labels, which in turn can help in cleaning the dataset and further in preparing high-quality training data set which is very important in different ML algorithms. Since at present preparing these dataset requires manual data annotations, CSL can save time in that process.<br />
<br />
For the Colorful-Mnist data set, the goal is to understand the concept of multiple classification tasks from these examples. All inputs have multiple classification tasks. Each observed sample only represents the classification result of one task, and the task from which the sample comes is unknown.<br />
<br />
It would be nice to know why the given metrics of confusing supervised learning are used. The authors should have used several different metrics and show that CSL's overall performs better than other methods. And what are "the other methods" referring to? algorithm<br />
<br />
In the Iterative Deconfusing algorithm section, the Training of Mapping-Net needs more explanation. The authors should specify what it is doing before showing its equations.<br />
<br />
For the results section, it would be more intuitive and stronger if the author provide more detail on these two methods and add a plot to support the claim. Based on the text, it might not be an obvious comparison.<br />
<br />
It will be interesting to see if this model can work for other datasets such as CIFAR-10, CIFAR-100, and ImageNet and how well it will perform on those datasets.<br />
<br />
Since this is the summary of the paper, it would be better if the introduction can concentrate more on multi-tasking and be trimmed a little by shorten the fruit example.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.<br />
<br />
[6] Guo-Ping Liu, Jian-Jun Yan, Yi-Qin Wang, Jing-Jing Fu, Zhao-Xia Xu, Rui Guo, Peng Qian, "Application of Multilabel Learning Using the Relevant Feature for Each Label in Chronic Gastritis Syndrome Diagnosis", Evidence-Based Complementary and Alternative Medicine, vol. 2012, Article ID 135387, 9 pages, 2012. https://doi.org/10.1155/2012/135387</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:T358wang&diff=49726User:T358wang2020-12-07T04:08:32Z<p>X277lu: /* Introduction */</p>
<hr />
<div><br />
== Group ==<br />
Rui Chen, Zeren Shen, Zihao Guo, Taohao Wang<br />
<br />
== Introduction ==<br />
<br />
Landmark recognition is an image retrieval task with unique applications which comes with its own specific challenges. It can be used in robotics to localize robots for navigation and also used by manipulators to detect different objects. This paper provides a new and effective method to recognize landmark images which can successfully identify statues, buildings, and other different characteristic objects. A relatable use-case is auto-sorting collections of photos to determine the location of the landmark, something we see occur in our smartphone's photo albums.<br />
<br />
There are many difficulties encountered in the development process:<br />
<br />
'''1.''' The concept of landmarks is not strictly defined. Landmarks can take various forms including objects and buildings.<br />
<br />
'''2.''' The same landmark can be photographed from different angles. Certain angles may capture the interior of a building as opposed to its exterior. This could result in vastly different picture characteristics between angles. A good model should accurately identify landmarks regardless of angles viewed.<br />
<br />
'''3.''' Often landmarks photographed at different times of the day look different based on the lighting and/or shift in shadows that the model needs to consider.<br />
<br />
'''4.''' The dataset is unbalanced. The majority of objects fall into the single class of "not landmarks", while relatively few images exist for each class of landmark. Hence, it is challenging to obtain both a low false-positive rate and a high recognition accuracy between classes of landmarks.<br />
<br />
There are also three potential problems:<br />
<br />
'''1.''' The processed data set contains a little error content; the image content is not clean, and the quantity is huge.<br />
<br />
'''2.''' The algorithm to learn the training set must be fast and scalable.<br />
<br />
'''3.''' While displaying high-quality judgment landmarks, there is no image geographic information mixed.<br />
<br />
The article describes the deep convolutional neural network (CNN) architecture, loss function, training method, and inference aspects. Using this model, similar metrics to the state of the art model in the test were obtained and the inference time was found to be 15 times faster. Furthermore, because of the efficient architecture, the system can be implemented in an online system. The results of quantitative experiments are displayed through testing and deployment effect analysis to prove the effectiveness of the model.<br />
<br />
== Related Work ==<br />
<br />
Landmark recognition can be regarded as one example of image retrieval. In the past two decades, a large number of studies have concentrated on image retrieval tasks, and the field has made significant progress. The methods adopted for image retrieval can be mainly divided into two categories. The first is a classic retrieval method using local features, a method based on local feature descriptors organized in bag-of-words, spatial verification, Hamming embedding, and query expansion. A bag of words model is defined as a simplified representation of the text information by retrieving only the significant words in a sentence or paragraph while disregarding its grammar. The bag of words approach is commonly used in classification tasks where the words are used as features in the model-training. These methods are dominant in image retrieval until the rise of deep convolutional neural networks (CNN), the second class of image retrieval methods, which are used to generate global descriptors of input images.<br />
<br />
Vector of Locally Aggregated Descriptors which uses first-order statistics of the local descriptor is one of the main representatives of local features-based methods. Another method is to selectively match the kernel Hamming embedding method extension. With the advent of deep convolutional neural networks, the most effective image retrieval method is based on training CNNs for specific tasks. Deep networks are very powerful for semantic feature representation, which allows us to effectively use them for landmark recognition. This method shows good results but brings additional memory and complexity costs. <br />
<br />
The DELF (DEep local feature) by Noh et al. proved promising results. This method combines the classic local feature method with deep learning. The dense features are extracted through fully convolutional network (FCN). To handle changes of scale, an image pyramid is constructed, and FCN is applied. Features are localized based on the receptive fields. <br />
<br />
[[File:delf.png | thumb | center | 1000px | The network architectures used for training (Noh et al., 2016)]]<br />
<br />
This allows us to extract local features from the input image and then use RANSAC for geometric verification. Random Sample Consensus (RANSAC) is a method to smooth data containing a significant percentage of errors, which is ideally suited for applications in automated image analysis where interpretation is based on the data generated by error-prone feature detectors. The goal of the project is to describe a method for accurate and fast large-scale landmark recognition using the advantages of deep convolutional neural networks.<br />
<br />
== Methodology ==<br />
<br />
This section will describe in detail the CNN architecture, loss function, training procedure, and inference implementation of the landmark recognition system. The figure below is an overview of the landmark recognition system.<br />
<br />
[[File:t358wang_landmark_recog_system.png |center|800px]]<br />
<br />
The landmark CNN consists of three parts: the main network, the embedding layer and the classification layer. To obtain a CNN main network suitable for training landmark recognition model, fine-tuning is applied and several pre-trained backbones (Residual Networks) based on other similar datasets, including ResNet-50, ResNet-200, SE-ResNext-101 and Wide Residual Network (WRN-50-2), are evaluated based on inference quality and efficiency, displayed in the table below. Residual networks are shallower architectures compared to usual deep networks and have their layers fit residuals mappings, <math> \mathcal{F}(x):= \mathcal{H}(x)-x </math>. The residual mappings are then recast into <math>\mathcal{F}(x)+x</math>. This is realized using shortcut connections to skip various layers and having the outputs added to the outputs of the stacked layers (He, Zhang, Ren, & Sun, 2016). Based on the evaluation results, WRN-50-2 is selected as the optimal backbone architecture. Fine-tuning is a very efficient technique in various computer vision applications because we can take advantage of everything the model has already learned and applied it to our specific task.<br />
<br />
[[File:t358wang_backbones.png |center|600px]]<br />
<br />
For the embedding layer, as shown in the below figure, the last fully-connected layer after the averaging pool is removed. Instead, a fully-connected 2048 <math>\times</math> 512 layer and a batch normalization layer is added as the embedding layer. After the batch normalization, a fully-connected 512 <math>\times</math> n layer is added as the classification layer. The below figure shows the overview of the CNN architecture of the landmark recognition system.<br />
<br />
[[File:t358wang_network_arch.png |center|800px]]<br />
<br />
To effectively determine the embedding vectors for each landmark class (centroids), the network needs to be trained to have the members of each class to be as close as possible to the centroids. Several suitable loss functions are evaluated including Contrastive Loss, Arcface, and Center loss. The center loss is selected since it achieves the optimal test results and it trains a center of embeddings of each class and penalizes distances between image embeddings as well as their class centers. In addition, the center loss is a simple addition to softmax loss and is trivial to implement.<br />
<br />
When implementing the loss function, a new additional class that includes all non-landmark instances needs to be added. The size of the non-landmark class is much greater than the classes of landmarks, and the non-landmark class would not have a specific structure because the non-landmark class contains all other objects. Hence the center loss function needs to be modified to focus on the landmark classes as followed. Let n be the number of landmark classes, m be the mini-batch size, <math>x_i \in R^d</math> is the i-th embedding and <math>y_i</math> is the corresponding label where <math>y_i \in</math> {1,...,n,n+1}, n+1 is the label of the non-landmark class. Denote <math>W \in R^{d \times n}</math> as the weights of the classifier layer, <math>W_j</math> as its j-th column. Let <math>c_{y_i}</math> be the <math>y_i</math> th embeddings center from Center loss and <math>\lambda</math> be the balancing parameter of Center loss. Then the final loss function will be: <br />
<br />
[[File:t358wang_loss_function.png |center|600px]]<br />
<br />
In the training procedure, the stochastic gradient descent(SGD) will be used as the optimizer with momentum=0.9 and weight decay = 5e-3. For the center loss function, the parameter <math>\lambda</math> is set to 5e-5. Each image is resized to 256 <math>\times</math> 256 and several data augmentations are applied to the dataset including random resized crop, color jitter, and random flip. The training dataset is divided into four parts based on the geographical affiliation of cities where landmarks are located: Europe/Russia, North America/Australia/Oceania, the Middle East/North Africa, and the Far East Regions. <br />
<br />
The paper introduces curriculum learning for landmark recognition, which is shown in the below figure. The algorithm is trained for 30 epochs and the learning rate <math>\alpha_1, \alpha_2, \alpha_3</math> will be reduced by a factor of 10 at the 12th epoch and 24th epoch.<br />
<br />
[[File:t358wang_algorithm1.png |center|600px]]<br />
<br />
In the inference phase, the paper introduces the term “centroids” which are embedding vectors that are calculated by averaging embeddings and are used to describe landmark classes. The calculation of centroids is significant to effectively determine whether a query image contains a landmark. The paper proposes two approaches to help the inference algorithm to calculate the centroids. First, instead of using the entire training data for each landmark, data cleaning is done to remove most of the redundant and irrelevant elements in the image. For example, if the landmark we are interested in is a palace located on a city square, then images of a similar building on the same square are included in the data which can affect the centroids. Second, since each landmark can have different shooting angles, it is more efficient to calculate a separate centroid for each shooting angle. Hence, a hierarchical agglomerative clustering algorithm is proposed to partition training data into several valid clusters for each landmark and the set of centroids for a landmark L can be represented by <math>\mu_{l_j} = \frac{1}{|C_j|} \sum_{i \in C_j} x_i, j \in 1,...,v</math> where v is the number of valid clusters for landmark L and v=1 if there is no valid clusters for L. <br />
<br />
Once the centroids are calculated for each landmark class, the system can make decisions whether there is a landmark in an image. The query image is passed through the landmark CNN and the resulting embedding vector is compared with all centroids by dot product similarity using approximate k-nearest neighbors (AKNN). To distinguish landmark classes from non-landmark, a threshold <math>\eta</math> is set and it will be compared with the maximum similarity to determine if the image contains any landmarks.<br />
<br />
The full inference algorithm is described in the below figure.<br />
<br />
[[File:t358wang_algorithm2.png |center|600px]]<br />
<br />
We will now look at how the landmark database was created. The collection process was structured by countries, cities and landmarks. They divided the world into several regions: Europe, America, Middle East, Africa, Far East, Australia and Oceania. Within each region, cities were selected that contained a lot of significant landmarks, and some natural landmarks were filtered out as they are difficult to distinguish. These include landmarks such as parks and beaches, unless they're very notable and popular, for example Santa Monica Pier. Natural landmarks Once the cities and landmarks were selected, both images and metadata were collected for each landmark.<br />
<br />
[[File:landmarkcleaning.png | center | 400px]]<br />
<br />
After forming the database, it had to be cleaned before it could be used to train the CNN. First, for each landmark, any redundant images were removed. Then, for each landmark, 5 images were picked that had a high probability of containing the landmark and were checked manually. The database was then cleaned by parts using the curriculum learning process. For each landmark, the centroid was calculated from the references. The centroid is compared to each image of the current landmark data using dot product similarity, and if it is less than the currenbt threshold <math> \gamma</math> the image is rejected. It is further described in the pseudocode above. The final database contained 11381 landmarks in 503 cities and 70 countries. With 2331784 landmark images and 900,000 non-landmark images. The number of landmarks that have less than 100 images is called "rare".<br />
<br />
== Experiments and Analysis ==<br />
<br />
'''Offline test'''<br />
<br />
In order to measure the quality of the model, an offline test set was collected and manually labeled. According to the calculations, photos containing landmarks make up 1 − 3% of the total number of photos on average. This distribution was emulated in an offline test, and the geo-information and landmark references weren’t used. <br />
The results of this test are presented in the table below. Two metrics were used to measure the results of experiments: Sensitivity — the accuracy of a model on images with landmarks (also called Recall) and Specificity — the accuracy of a model on images without landmarks. Several types of DELF were evaluated, and the best results in terms of sensitivity and specificity were included in the table below. The table also contains the results of the model trained only with Softmax loss, Softmax, and Center loss. Thus, the table below reflects improvements in our approach with the addition of new elements in it.<br />
<br />
[[File:t358wang_models_eval.png |center|600px]]<br />
<br />
It's very important to understand how a model works on “rare” landmarks due to the small amount of data for them. Therefore, the behavior of the model was examined separately on “rare” and “frequent” landmarks in the table below. The column “Part from total number” shows what percentage of landmark examples in the offline test has the corresponding type of landmarks. And we find that the sensitivity of “frequent” landmarks is much higher than “rare” landmarks.<br />
<br />
[[File:t358wang_rare_freq.png |center|600px]]<br />
<br />
Analysis of the behavior of the model in different categories of landmarks in the offline test is presented in the table below. These results show that the model can successfully work with various categories of landmarks. Predictably better results (92% of sensitivity and 99.5% of specificity) could also be obtained when the offline test with geo-information was launched on the model.<br />
<br />
[[File:t358wang_landmark_category.png |center|600px]]<br />
<br />
'''Revisited Paris dataset'''<br />
<br />
Revisited Paris dataset (RPar)[2] was also used to measure the quality of the landmark recognition approach. This dataset with Revisited Oxford (ROxf) is standard benchmarks for the comparison of image retrieval algorithms. In recognition, it is important to determine the landmark, which is contained in the query image. Images of the same landmark can have different shooting angles or taken inside/outside the building. Thus, it is reasonable to measure the quality of the model in the standard and adapt it to the task settings. That means not all classes from queries are presented in the landmark dataset. For those images containing correct landmarks but taken from different shooting angles within the building, we transferred them to the “junk” category, which does not influence the final score and makes the test markup closer to our model’s goal. Results on RPar with and without distractors in medium and hard modes are presented in the table below. <br />
<br />
<div style="text-align:center;"> '''Revisited Paris Medium''' </div><br />
[[File:t358wang_methods_eval1.png |center|600px]]<br />
<br />
<br />
<div style="text-align:center;"> '''Revisited Paris Hard''' </div><br />
[[File:t358wang_methods_eval2.png |center|600px]]<br />
<br />
== Comparison ==<br />
<br />
Recent most efficient approaches to landmark recognition are built on fine-tuned CNN. We chose to compare our method to DELF on how well each performs on recognition tasks. A brief summary is given below:<br />
<br />
[[File:t358wang_comparison.png |center|600px]]<br />
<br />
''' Offline test and timing '''<br />
<br />
Both approaches obtained similar results for image retrieval in the offline test (shown in the sensitivity&specificity table), but the proposed approach is much faster on the inference stage and more memory efficient.<br />
<br />
To be more detailed, during the inference stage, DELF needs more forward passes through CNN, has to search the entire database, and performs the RANSAC method for geometric verification. All of them make it much more time-consuming than our proposed approach. Our approach mainly uses centroids, this makes it take less time and needs to store fewer elements.<br />
<br />
== Conclusion ==<br />
<br />
In this paper, we were hoping to solve some difficulties that emerge when trying to apply landmark recognition with a scalable approach to the production level, which is the kind of implementation that has been deployed to production at scale and used for recognition of user photos in the Mail.ru cloud application: there might not be a clean & sufficiently large database for interesting tasks, algorithms should be fast, scalable, and should aim for low FP and high accuracy.<br />
<br />
While aiming for these goals, we presented a way of cleaning landmark data. And most importantly, we introduced the usage of embeddings of deep CNN to make recognition fast and scalable, trained by curriculum learning techniques with modified versions of Center loss. Compared to the state-of-the-art methods, this approach shows similar results but is much faster and suitable for implementation on a large scale.<br />
<br />
== Critique ==<br />
The paper selected 5 images per landmark and checked them manually. That means the training process takes a long time on data cleaning and so the proposed algorithm lacks reusability. Also, since only the landmarks that are the largest and most popular were used to train the CNN, the trained model will probably be most useful in big cities instead of smaller cities with less popular landmarks.<br />
<br />
In addition, researchers often look for reliability and reproducibility. By using a private database and manually labeling, it lends itself to an array of issues in terms of validity and integrity. Researchers who are looking for such an algorithm will not be able to sufficiently determine if the experiments do actually yield the claimed results. Also, manual labeling by those who are related to the individuals conducting this research also raises the question of conflict of interest. The primary experiment of this paper should be on public and third-party datasets.<br />
<br />
CNN is very similar to ONN (ordinary Neural Networks), each neuron receives some inputs, and then performs a dot product and expresses a single differentiable score function. But what makes CNN interesting is that the neurons are with learnable weights and biases. In this case, it's much easier for us to make some analysis.<br />
<br />
It might be worth looking into the ability to generalize better, for example, can this model applied to other structures such as logos, symbols, etc. It would allow a different point of view of how the models perform well or not. <br />
<br />
This is a very interesting implementation in some specific field. The paper shows a process to analyze the problem and trains the model based on deep CNN implementation. In future work, it would be some practical advice to compare the deep CNN model with other models. By comparison, we might receive a more comprehensive training model for landmark recognition.<br />
<br />
This summary has a good structure and the methodology part is very clear for readers to understand. Using some diagrams for the comparison with other methods is good for visualization for readers. Since the dataset is marked manually, so it is kind of time-consuming for training a model. So it might be interesting to discuss how the famous IT company (i.e. Google etc.) fix this problem.<br />
<br />
It would be beneficial if the authors could provide more explanations regarding the DELF method. Visualization of the differences between DELF and CNN from an algorithm and architecture perspective would be highly significant for the context of this paper.<br />
<br />
One challenge of landmark recognition is a large number of classes. It would be good to see the comparison between the proposed model and other models in terms of efficiency.<br />
<br />
The scope of this paper seems to work specifically with some of the most well-known landmarks in the world, and many of these landmarks are well known because they are very distinct in how they look. It would be interesting to see how well the model works when classifying different landmarks of similar type (ie, Notre Dame Cathedral vs. St. Paul's Cathedral, etc.). It would also be interesting to see how this model compares with other models in the literature, or if this is unique, perhaps the authors could scale this model down to a landmark classification problem (castles, churches, parks, etc.) and compare it against other models that way.<br />
<br />
Paper 25 (Loss Function Search in Facial Recognition) also utilizes the softmax loss function in feature discrimination in images. The difference between this paper and paper 25 is that this paper focuses on landmark images, whereas paper 25 is on facial recognition. Despite the slightly different application, both papers prove the importance of using the softmax loss function in feature discrimination, which is pretty neat. Since both papers use softmax loss function for image recognition, it will be interesting to apply it to a dataset where both a landmark and human face are present, for example, photos on Instagram. Since Instagram users usually tag the landmark, supervised learning methods such as a multi-task learning model can be easily applied to the dataset. (See paper 14)<br />
<br />
The paper uses typical representatives of landmarks in various continents during training. It would be interesting to see how different results would be if the average structure of all landmarks present were used.<br />
<br />
The curriculum learning algorithm categories the landmarks into four different geographical locations, i.e (1) Europe, (2) North America, Australia and Oceania; (3) Middle East, North Africa; (4) Far east, because landmarks from the same geographical locations are similar to each other than from different geographical locations. Landmarks can also be categorized into archeological landmarks like medieval castles, architectural landmarks like monuments, biological landmarks like some parks or gardens, geological landmarks like caves etc, and it would be interesting to explore what kinds of results the curriculum learning algorithm would produce if the landmarks were categorized in different ways. Also In paper 17, i.e poker AI, uses an approach where similar decision points in the game were grouped together in a process called abstraction. Therefore a similar kind of approach of grouping objects is used in the methods described in both papers which indeed helps in reducing the computational time and makes the process more scalable.<br />
<br />
Given that the curriculum learning method in the paper has 4 categories that divide the data into continents, would adding more categories increase the accuracy of the algorithm? For instance, by using the continent information and meta-data from images, and even the lighting of the images, it may be possible to determine more detailed location data of the images. Also, what about landmarks that do not conform to the "style" of architecture in a continent? Are landmarks of "styles" that are less common in all of the continents more difficult to identify with this algorithm?<br />
<br />
Authors have outlined 3 potential problems but they did not explain how to fix them. It's better to specify how can we handle these problems otherwise we cannot apply this model to the real world problems. For example, the real world problems always have big training data and the content might not be clean.<br />
<br />
I got an impression that it works on a Large Scale Landmark Dataset from the title. However it does not mention anything about the process or proof that it can efficiently and accurately classify a large landmarks dataset. It was also only mentioned that the first dataset was "collected and manually labeled", but does not mention how it was collected. The second dataset was landmarks in Paris. If the author were to tune the model only according to these data, it is obvious that it can not be generalized well. It might be able to classify landmarks at where the images in dataset are taken or landmarks in Paris, but it might not work in other areas that has different culture, where the landmarks might look very different.<br />
<br />
It would be better that if the summary have an extended usage of the model. For instance, if this model can applied to any other aspects rather than landmark recognition. It will give us a more general understanding of the model.<br />
<br />
It is really important that all the data comes from one format, such as all comes from an image. Because sometimes it is really expensive to go scan the land structure in person, and the form of images is more common. So it is better that the author can focus on one form so that cleaner data can be captured.<br />
<br />
In the introduction, it was claimed that the deep CNN could be used in an online fashion due to its efficiency. It would be better if the author could provide a benchmark of existing methods and the model given in this paper in an online fashion to compare their accuracy and efficiency.<br />
<br />
== References ==<br />
[1] Andrei Boiarov and Eduard Tyantov. 2019. Large Scale Landmark Recognition via Deep Metric Learning. In The 28th ACM International Conference on Information and Knowledge Management (CIKM ’19), November 3–7, 2019, Beijing, China. ACM, New York, NY, USA, 10 pages. https://arxiv.org/pdf/1908.10192.pdf 3357384.3357956<br />
<br />
[2] FilipRadenović,AhmetIscen,GiorgosTolias,YannisAvrithis,andOndřejChum.<br />
2018. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking.<br />
arXiv preprint arXiv:1803.11285 (2018).<br />
<br />
[3] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-771. doi:10.1109/cvpr.2016.90</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=49717Adversarial Attacks on Copyright Detection Systems2020-12-07T03:59:20Z<p>X277lu: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==Introduction ==<br />
The copyright detection system is one of the most commonly used machine learning systems. Important real-world applications include tools such as Google Jigsaw which can identify and remove video promoting terrorism, or applications like YouTube where machine learning system is used to flag content that infringes copyrights. Failure to do so will result in legal consequences. However, the adversarial attacks on these systems have not been widely addressed by the public and remain largely unexplored. Adversarial attacks are instances where inputs are intentionally designed by people to cause misclassification in the model. <br />
<br />
Copyright detection systems are vulnerable to attacks for three reasons:<br />
<br />
1. Unlike physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is an open-set problem, which means that the uploaded files may not correspond to an existing class. In this case, it prevents users from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content with different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
The goal of this paper is to raise awareness of the security threats faced by copyright detection systems. In this paper, different types of copyright detection systems will be introduced. As an example, a widely used detection model from Shazam, a popular application used for recognizing music will be discussed in the paper. As a proof-of-concept, the paper generates audio fingerprints using convolutional neural networks and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then, the adversarial attacks are applied to the industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems.<br />
<br />
== Types of copyright detection systems ==<br />
Fingerprinting algorithms work by extracting the features of a source file as a hash and then utilizing a matching algorithm to compare that to the materials protected by copyright in the database. If enough matches are found between the source and existing data, then the two samples are considered identical. Thus, the copyright detection system is able to reject the copyright declaration of the source. Most audio, image, and video fingerprinting algorithms work by training a neural network to output features or extracting hand-crafted features.<br />
<br />
In terms of video fingerprinting, a useful algorithm is to detect the entering/leaving time of the objects in the video (Saviaga & Toxtli, 2018). The final hash consists of the entering/leaving of different objects and a unique relationship of the objects. However, most of these video fingerprinting algorithms only train their neural networks by using simple distortions such as adding noise or flipping the video rather than adversarial perturbations. This leads to algorithms that are strong against pre-defined distortions, but not adversarial attacks.<br />
<br />
Moreover, some plagiarism detection systems also depend on neural networks to generate a fingerprint of the input document. Though using deep feature representations as a fingerprint is efficient in detecting plagiarism, it still might be weak to adversarial attacks.<br />
<br />
Audio fingerprinting may perform better than the algorithms above. This is because the hash is usually generated by extracting hand-crafted features rather than training a neural network. That being said, it still is easy to attack.<br />
<br />
== Case study: evading audio fingerprinting ==<br />
<br />
=== Audio Fingerprinting Model===<br />
The audio fingerprinting model plays an important role in copyright detection. It is useful for quickly locating or finding similar samples inside an audio database. Shazam is a popular music recognition application, which uses one of the most well-known fingerprinting models. Because of the three properties: temporal locality, translation invariance, and robustness, Shazam's algorithm is treated as a good fingerprinting algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes. Spectrograms are two-dimensional representations of audio frequency spectra over time. An example is shown below. <br />
<br />
<div style="text-align:center;">[[File:Spectrogram-19thC.png|Spectrogram-19thC|390px]]</div><br />
<div align="center"><span style="font-size:80%">Source:https://commons.wikimedia.org/wiki/File:Spectrogram-19thC.png</span></div><br />
<br />
=== Interpreting the fingerprint extractor as a CNN ===<br />
The intention of this section is to build a differentiable neural network whose function resembles that of an audio fingerprinting algorithm, which is well-known for its ability to identify the meta-data, i.e. song names, artists, and albums, while independent of an audio format (Group et al., 2005). The generic neural network model will then be used as an example of black-box attacks on many popular real-world systems, in this case, YouTube and audio tag. <br />
<br />
The generic neural network model consists of two convolutional layers and a max-pooling layer, which is used for dimension reduction. This is depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporal locality and transformational invariance. In which, temporal locality refers to a program accessing the same memory location within a small time frame. This is in contrast to spatial locality, where the program accesses nearby memory locations within a small time frame (Pingali, 2011). The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending times of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
$$ f_{1}(n)=\frac {\sin^2(\frac{\pi n} {N})} {\sum_{i=0}^N \sin^2(\frac{\pi i}{N})} $$ <br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
$$f_{2}(k,n)=e^{-i 2 \pi k n / N} $$<br />
where k <math>{\in}</math> 0,1,...,N-1 (output channel index) and n <math>{\in}</math> 0,1,...,N-1 (index of filter coefficient)<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function to become robust to changes in the position of the feature. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
=== Formulating the adversarial loss function ===<br />
<br />
In the previous section, local maxima of the spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified to compare how similar two fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different (Hamming distance, 2020). For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks by knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function that involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak is in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(\text{ReLU}\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the outputs of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function. Note that <math>||\delta||_\infty = \max_k |\delta_k|</math> (the largest component of <math>\delta</math>) which makes it relatively easy to satisfy the above constraint by clipping the values of each iteration of <math>\delta</math> to be less than <math>\epsilon</math>. However, since <math>||\delta||_\infty \leq \dots \leq ||\delta||_2 \leq ||\delta||_1</math> this choice of norm results in the loosest bound, and therefore admits a larger selection of perturbations <math>\delta</math> than might be permitted by other choices of norms.<br />
<br />
=== Remix adversarial examples===<br />
While solving the optimization problem, the resulted example would be able to fool the copyright detection system. But it could sound unnatural with the perturbations.<br />
<br />
Instead, the fingerprinting could be made in a more natural way (i.e., a different audio signal). <br />
<br />
By modifying the loss function, which switches the order of the max-pooling layers in the smooth maximum components in the loss function, this remix loss function is to make two signals x and y look as similar as possible.<br />
<br />
$$J_{remix}(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_2}{\max}\phi(i+j;x)-\underset{|j| \leq w_1}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
<br />
By adding this new loss function, a new optimization problem could be defined. <br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x) + \lambda J_{remix}(x+\delta,y)\\<br />
s.t.||\delta||_{p}\le\epsilon<br />
$$<br />
<br />
where <math>{\lambda}</math> is a scalar parameter that controls the similarity of <math>{x+\delta}</math> and <math>{y}</math>.<br />
<br />
This optimization problem is able to generate an adversarial example from the selected source, and also enforce the adversarial example to be similar to another signal. The resulting adversarial example is called Remix adversarial example because it gets the references to its source signal and another signal.<br />
<br />
== Evaluating transfer attacks on industrial systems==<br />
The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. <math>{l_{\infty}}</math> norm and <math>{l_{2}}</math> norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.<br />
<br />
Before evaluating black-box attacks against real-world systems, white-box attacks against our own proposed model is used to provide the baseline of adversarial examples’ effectiveness. Loss function <math>{J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|}</math> is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.<br />
<br />
[[File:Table_1_White-box.jpg |center ]]<br />
<br />
<div align="center">Table 1: Norms of the perturbations for white-box attacks</div><br />
<br />
In black-box attacks, by applying random perturbations to the audio recordings, AudioTag’s claim of being robust to input<br />
distortions was verified. However, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations. <br />
<br />
Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter <math>{\epsilon}</math> is required, which makes perturbations obvious although songs can still be recognized by humans. YouTube Content ID system has a more robust fingerprint model.<br />
<br />
<br />
[[File:Table_2_Black-box.jpg |center]]<br />
<br />
<div align="center">Table 2: Norms of the perturbations for black-box attacks</div><br />
<br />
[[File:YouTube_Figure.jpg |center]]<br />
<br />
<div align="center">Figure 2: YouTube’s copyright detection recall against the magnitude of noise</div><br />
<br />
== Conclusion ==<br />
In this paper, we obtain that many industrial copyright detection systems used in popular video and music websites, such as YouTube and AudioTag, are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using a neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. <br />
<br />
Although the method used in the paper isn't optimal, the attacks can be strengthened using sharper technical. Also, we could transfer attacks using rudimentary surrogate models that rely on hand-crafted features, while commercial systems likely rely on trainable neural nets.<br />
<br />
Our goal here is not to facilitate but the intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of copyright detection system. A number of mitigating approaches already exist, such as adversarial training, but they need to be further developed and examined in order to robustly protect against the threat of adversarial copyright attacks.<br />
<br />
== Appendix ==<br />
=== Feature Extraction Process in Audio-Fingerprinting System ===<br />
<br />
1. Preprocessing. In this step, the audio signal is digitalized and quantized at first.<br />
Then, it is converted to a mono signal by averaging two channels if necessary.<br />
Finally, it is resampled if the sampling rate is different from the target rate.<br />
<br />
2. Framing. Framing means dividing the audio signal into frames of equal length<br />
by a window function.<br />
<br />
3. Transformation. This step is designed to transform the set of frames into a new set<br />
of features, in order to reduce the redundancy. <br />
<br />
4. Feature Extraction. After transformation, final acoustic features are extracted<br />
from the time-frequency representation. The main purpose is to reduce the<br />
dimensionality and increase the robustness to distortions.<br />
<br />
5. Post-processing. To capture the temporal variations of the audio signal, higher<br />
order time derivatives are required sometimes.<br />
<br />
== Critiques ==<br />
- The experiments in this paper appear to be a proof-of-concept rather than a serious evaluation of a model. One problem is that the norm is used to evaluate the perturbation. Unlike the norm in image domains which can be visualized and easily understood, the perturbations in the audio domain are more difficult to comprehend. A cognitive study or something like a user study might need to be conducted in order to understand this. Another question related to this is that if the random noise is 2x bigger or 3x bigger in terms of the norm, does this make a huge difference when listening to it? Are these two perturbations both very obvious or unnoticeable? In addition, it seems that a dataset is built but the stats are missing. Third, no baseline methods are being compared to in this paper, not even an ablation study. The proposed two methods (default and remix) seem to perform similarly.<br />
<br />
- There could be an improvement in term of how to find the threshold in general, it mentioned how to measure the similarity of two pieces of content but have not discussed what threshold should we set for this model. In fact, it is always a challenge to determine the boundary of "Copyright Issue" or "Not Copyright Issue" and this is some important information that may be discussed in the paper.<br />
<br />
- The fingerprinting technique used in this paper seems rather elementary, which is a downfall in this context because the focus of this paper is adversarial attacks on these methods. A recent 2019 work (https://arxiv.org/pdf/1907.12956.pdf) proposed a deep fingerprinting algorithm along with some novel framing of the problem. There are several other older works in this area that also give useful insights that would have improved the algorithm in this paper.<br />
<br />
- Figure 1 clearly indicates the stricture of an audio fingerprinting model with 2 convolution layers and a max pooling layer. In the following paragraphs, it shows how and why the author choose to use each layer and what we could get at the output layer. This gives us a general thought of how this type of neural network is used to deal with data.<br />
<br />
- In the experiment section, authors could provide more background information on the dataset used. For example, the number of songs in the dataset and a brief introduction of different features.<br />
<br />
- Since the paper didn't go into details of how features are extracted in an audio-fingerprinting system, the details are listed out above in "Feature Extraction Process in Audio-Fingerprinting System"<br />
<br />
- In Shazam algorithm, the author should explore more on when the system will detect the copyright issue. Such as should the copyright issue be raised if there is only 5 seconds of melody or lyrics that are similar to other music?<br />
<br />
- In addition to Audio files, can this system detect copyright attacks on mixed data such as say, embedded audio files in the text? It will be interesting to see if it can do it. Nowadays, a lot of data is coming in mixed form, e.g. text+video, audio+text etc and therefore having a system to detect adversarial attacks on copyright detection systems for mixed data will be a useful development.<br />
<br />
- When introducing the types of copyright detection systems, there is a lack of clearness in explaining fingerprinting algorithms for audios and images. Besides a brief, one-line explanation, justifications regarding why audio fingerprinting algorithms are better compared to video fingerprinting algorithms were not provided. Moreover, discussions regarding image fingerprinting algorithms were missing. Hence, itt extent of the would be beneficial to add these two sections to enhance the readers' understanding of this summary.<br />
<br />
- It says: "Fingerprinting algorithms work by extracting the features of a source file as a hash", but it seems like they should have included more information about it so that readers could understand how feature extraction works in this case.<br />
<br />
- It would be better if explanations on all algorithms and terminologies are included, maybe just point somewhere contains such information to keep this summary short and clear.<br />
<br />
- The fingerprint algorithm in this article is the only algorithm discussed in the paper, the author should discuss the evolution of the algorithm and add some comparisons of with other algorithms that currently exist.<br />
<br />
== References ==<br />
<br />
Group, P., Cano, P., Group, M., Group, E., Batlle, E., Ton Kalker Philips Research Laboratories Eindhoven, . . . Authors: Pedro Cano Music Technology Group. (2005, November 01). A Review of Audio Fingerprinting. Retrieved November 13, 2020, from https://dl.acm.org/doi/10.1007/s11265-005-4151-3<br />
<br />
Hamming distance. (2020, November 1). In ''Wikipedia''. https://en.wikipedia.org/wiki/Hamming_distance<br />
<br />
Jovanovic. (2015, February 2). ''How does Shazam work? Music Recognition Algorithms, Fingerprinting, and Processing''. Toptal Engineering Blog. https://www.toptal.com/algorithms/shazam-it-music-processing-fingerprinting-and-recognition<br />
<br />
Saadatpanah, P., Shafahi, A., &amp; Goldstein, T. (2019, June 17). ''Adversarial attacks on copyright detection systems''. Retrieved November 13, 2020, from https://arxiv.org/abs/1906.07153.<br />
<br />
Saviaga, C. and Toxtli, C. ''Deepiracy: Video piracy detection system by using longest common subsequence and deep learning'', 2018. https://medium.com/hciwvu/piracy-detection-using-longestcommon-subsequence-and-neuralnetworks-a6f689a541a6<br />
<br />
Wang, A. et al. ''An industrial strength audio search algorithm''. In Ismir, volume 2003, pp. 7–13. Washington, DC, 2003.<br />
<br />
Pingali, K. (2011). Locality of Reference and Parallel Processing. Encyclopedia of Parallel Computing, 21. doi:https://doi.org/10.1007/978-0-387-09766-4_206</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playing&diff=49714what game are we playing2020-12-07T03:55:23Z<p>X277lu: /* Introduction */</p>
<hr />
<div>== Authors == <br />
Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran <br />
<br />
== Introduction ==<br />
Recently, there have been intensive studies of methods using AI to search for the optimal solution of large-scale, zero-sum and extensive form problems. However, most of these works operate under the assumption that the parameters of the game are known, while this scenario is ideal. In real-world problems, most of the parameters are unknown prior to the game. This paper proposes a framework to find an optimal solution using a primal-dual Newton Method and then using back-propagation to analytically compute the gradients of all the relevant game parameters.<br />
<br />
The approach to solving this problem is to consider ''quantal response equilibrium'' (QRE), which is a generalization of Nash equilibrium (NE) where the agents can make suboptimal decisions. It is shown that the solution to the QRE is a differentiable function of the payoff matrix. Consequently, back-propagation can be used to analytically solve for the payoff matrix (or other game parameters). This strategy has many future application areas as it allows for game-solving (both extensive and normal form) to be integrated as a module in a deep neural network.<br />
<br />
An example of architecture is presented below:<br />
<br />
[[File:Framework.png ]]<br />
<br />
Payoff matrix <math> P </math> is parameterized by a domain-dependent low dimensional vector <math> \phi </math>, where <math> \phi </math> depends on a differentiable function <math> M_1(x) </math>. Furthermore, <math> P </math> is applied to QRE to get the equilibrium strategies <math> (u^∗, v^∗) </math>. Lastly, the loss function is calculated after applying any differentiable <math> M_2(u^∗, v^∗) </math>.<br />
<br />
The effectiveness of this model is demonstrated using the games “Rock, Paper, Scissors”, one-card poker, and a security defense game. This could represent a substantial step forward in understanding how game-theoretic methods can be applied to uncertain settings by the ability to learn solely from observed actions and the relevant parameters.<br />
<br />
== Learning and Quantal Response in Normal Form Games ==<br />
<br />
The game-solving module provides all elements required in differentiable learning, which maps contextual features to payoff matrices, and computes equilibrium strategies under a set of contextual features. This paper will learn zero-sum games and start with normal form games since they have game solver and learning approach capturing much of intuition and basic methodology.<br />
<br />
=== Zero-Sum Normal Form Games ===<br />
<br />
A normal form game is defined as a game with a finite number of players, <math>N</math>, where each player has a non-empty and finite action space, <math>A_i</math>. The outcome of the game is defined by the action profile <math>a = (a_1,...,a_n)</math>, where <math>a_i</math> is the action taken by the <math>i^{th}</math> player. In addition, each player has an associated utility function that is given by <math>u_i: A_1\times...\times A_n \rightarrow\mathbb{R}</math>. An example of a normal form game is the prisoner's dilemma. Zero-sum games refer to a game in which one player gains at the detriment of the other player. Mathematically, <math>\forall a\in A_1\times A_2, u_2(a) + u_2(a) = 0</math> (Larson, "Game-theoretic methods for computer science Normal Form Games"). <br />
<br />
In two-player zero-sum games there is a '''payoff matrix''' <math>P</math> that describes the rewards for two players employing specific strategies u and v respectively. The optimal strategy mixture may be found with a classic min-max formulation:<br />
$$\min_u \max_v \ u^T P v \\ subject \ to \ 1^T u =1, u \ge 0 \\ 1^T v =1, v \ge 0. \ $$<br />
<br />
Here, we consider the case where <math>P</math> is not known a priori. <math>P</math> could be a single fixed but unknown payoff<br />
matrix, or depend on some external context <math>x</math>. The solution <math> (u^*, v_0) </math> to this optimization and the solution <math> (u_0,v^*) </math> to the corresponding problem with inverse player order form the Nash equilibrium <math>(u^*,v^*) </math>. At this equilibrium, the players do not have anything to gain by changing their strategy, so this point is a stable state of the system. When the payoff matrix P is not known, we observe samples of actions <math> a^{(i)}, i =1,...,N </math> from one or both players, which depends on some external content <math> x </math>, sampled from the equilibrium strategies <math>(u^*,v^*) </math>, to recover the true underlying payoff matrix P or a function form P(x) depending on the current context.<br />
<br />
=== Quantal Response Equilibria ===<br />
<br />
However, NE is poorly suited because NEs are overly strict, discontinuous with respect to P, and may not be unique. To address these issues, model the players' actions with the '''quantal response equilibria''' (QRE), where noise is added to the payoff matric. Specifically, consider the ''logit'' equilibrium for zero-sum games that obeys the fixed point:<br />
$$<br />
u^* _i = \frac {exp(-Pv)_i}{\sum_{q \in [n]} exp (-Pv)_q}, \ v^* _j= \frac {exp(P^T u)_j}{\sum_{q \in [m]} exp (P^T u)_q} .\qquad \ (1)<br />
$$<br />
For a fixed opponent strategy, the logit equilibrium corresponding to a strategy is strictly convex, and thus the regularized best response is unique.<br />
<br />
=== End-to-End Learning ===<br />
<br />
Then to integrate a zero-sum solver, [1] introduced a method to solve the QRE and to differentiate through its solution.<br />
<br />
'''QRE solver''':<br />
To find the fixed point in (1), it is equivalent to solve the regularized min-max game:<br />
$$<br />
\min_{u \in \mathbb{R}^n} \max_{v \in \mathbb{R}^m} \ u^T P v -H(v) + H(u) \\<br />
\text{subject to } 1^T u =1, \ 1^T v =1, <br />
$$<br />
where H(y) is the Gibbs entropy <math> \sum_i y_i log y_i</math>.<br />
Entropy regularization guarantees the non-negative condition and makes the equilibrium continuous with respect to P, which means players are encouraged to play more randomly, and all actions have a non-zero probability. Moreover, this problem has a unique saddle point corresponding to <math> (u^*, v^*) </math>.<br />
<br />
Using a primal-dual Newton Method to solve the QRE for two-player zero-sum games, the KKT conditions for the problem are:<br />
$$ <br />
Pv + \log(u) + 1 +\mu 1 = 0 \\<br />
P^T v -\log(v) -1 +\nu 1 = 0 \\<br />
1^T u = 1, \ 1^T v = 1, <br />
$$<br />
where <math> (\mu, \nu) </math> are Lagrange multipliers for the equality constraints on u, v respectively. Then applying Newton's method gives the the update rule:<br />
$$<br />
Q \begin{bmatrix} \Delta u \\ \Delta v \\ \Delta \mu \\ \Delta \nu \\ \end{bmatrix} = - \begin{bmatrix} P v + \log u + 1 + \mu 1 \\ P^T u - \log v - 1 + \nu 1 \\ 1^T u - 1 \\ 1^T v - 1 \\ \end{bmatrix}, \qquad (2)<br />
$$<br />
where Q is the Hessian of the Lagrangian, given by <br />
$$ <br />
Q = \begin{bmatrix} <br />
diag(\frac{1}{u}) & P & 1 & 0 \\ <br />
P^T & -diag(\frac{1}{v}) & 0 & 1\\<br />
1^T & 0 & 0 & 0 \\<br />
0 & 1^T & 0 & 0 \\<br />
\end{bmatrix}. <br />
$$<br />
<br />
'''Differentiating Through QRE Solutions''':<br />
The QRE solver provides a method to compute the necessary Jacobian-vector products. Specifically, we compute the gradient of the loss given the solution <math> (u^*,v^*) </math> to the QRE, and some loss function <math> L(u^*,v^*) </math>: <br />
<br />
1. Take differentials of the KKT conditions: <br />
<math><br />
Q \begin{bmatrix} <br />
du & dv & d\mu & d\nu \\ <br />
\end{bmatrix} ^T = \begin{bmatrix} <br />
-dPv & -dP^Tu & 0 & 0 \\ <br />
\end{bmatrix}^T. \ <br />
</math><br />
<br />
2. For small changes du, dv, <br />
<math><br />
dL = \begin{bmatrix} <br />
v^TdP^T & u^TdP & 0 & 0 \\ <br />
\end{bmatrix} Q^{-1} \begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
3. Apply this to P, and take limits as dP is small:<br />
<math><br />
\nabla_P L = y_u v^T + u y_v^T, \qquad (3)<br />
</math> where <br />
<math><br />
\begin{bmatrix} <br />
y_u & y_v & y_{\mu} & y_{\nu}\\ <br />
\end{bmatrix}=Q^{-1}\begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
Hence, the forward pass is given by using the expression in (2) to solve for the logit equilibrium given P, and the backward pass is given by using <math> \nabla_u L </math> and <math> \nabla_v L </math> to obtain <math> \nabla_P L </math> using (3). There does not always exist a unique P which generates <math> u^*, v^* </math> under the logit QRE, and we cannot expect to recover P when under-constrained.<br />
<br />
== Learning Extensive form games ==<br />
<br />
The normal form representation for games where players have many choices quickly becomes intractable. For example, consider a chess game: One the first turn, player 1 has 20 possible moves and then player 2 has 20 possible responses. If in the following number of turns each player is estimated to have ~30 possible moves and if a typical game is 40 moves per player, the total number of strategies is roughly <math>10^{120} </math> per player (this is known as the Shannon number for game-tree complexity of chess) and so the payoff matrix for a typical game of chess must therefore have <math> O(10^{240}) </math> entries.<br />
<br />
Instead, it is much more useful to represent the game graphically as an "''' Extensive form game'''" (EFG). We'll also need to consider types of games where there is '''imperfect information''' - players do not necessarily have access to the full state of the game. An example of this is one-card poker: (1) Each player draws a single card from a 13-card deck (ignore suits) (2) Player 1 decides whether to bet/hold (3) Player 2 decides whether to call/raise (4) Player 1 must either call/fold if Player 2 raised. From this description, player 1 has <math> 2^{13} </math> possible first moves (all combinations of (card, raise/hold)) and has <math> 2^{13} </math> possible second moves (whenever player 1 gets a second move) for a total of <math> 2^{26} </math> possible strategies. In addition, Player 1 never knows what cards player 2 has and vice versa. So instead of representing the game with a huge payoff matrix, we can instead represent it as a simple decision tree (for a ''single'' drawn the card of player 1):<br />
<br />
<br />
<center> [[File:1cardpoker.PNG]] </center><br />
<br />
where player 1 is represented by "1", a node that has two branches corresponding to the allowed moves of player 1. However there must also be a notion of information available to either player: While this tree might correspond to say, player 1 holding a "9", it contains no information on what card player 2 is holding (and is much simpler because of this). This leads to the definition of an '''information set''': the set of all nodes belonging to a single player for which the other player cannot distinguish which node has been reached. The information set may therefore be treated as a node itself, for which actions stemming from the node must be chosen in ignorance of what the other player did immediately before arriving at the node. In the poker example, the full game tree consists of a much more complex version of the tree shown above (containing repetitions of the given tree for every possible combination of cards dealt) and the and an example of the information set for player 1 is the set of all of the nodes owned by player 2 that immediately follow player 1's decision to hold. In other words, if player 1 holds there are 13 possible nodes describing the responses of player 2 (raise/hold for player 2 having card = ace, 1, ... King), and all 13 of these nodes are indistinguishable to player 1, and so form the information set for player 1.<br />
<br />
The following is a review of important concepts for extensive form games first formalized in [2]. Let <math> \mathcal{I}_i </math> be the set of all information sets for player i, and for each <math> t \in \mathcal{I}_i </math> let <math> \sigma_t </math> be the actions taken by player i to arrive at <math> t </math> and <math> C_t </math> be the actions that player i can take from <math> u </math>. Then the set of all possible sequences that can be taken by player i is given by<br />
<br />
$$<br />
S_i = \{\emptyset \} \cup \{ \sigma_t c | u\in \mathcal{I}_i, c \in C_t \}<br />
$$<br />
<br />
So for the one-card poker we would have <math>S_1 = \{\emptyset, \text{raise}, \text{hold}, \text{hold-call}, \text{hold-fold\} }</math>. From the possible sequences follows two important concepts:<br />
<ol><br />
<li>The EFG '''payoff matrix''' <math> P </math> of size <math>|S_1| \times |S_2| </math> (this is all possible actions available to either player), is populated with rewards from each leaf of the tree (or "zero" for each <math> (s_1, s_2) </math> that is an invalid pair), and the expected payoff for realization plans <math> (u, v) </math> is given by <math> u^T P v </math> </li><br />
<li> A '''realization plan''' <math> u \in \mathbb{R}^{|S_1|} </math> for player 1 (<math> v \in \mathbb{R}^{|S_2|} </math> for player 2 ) will describe probabilities for players to carry out each possible sequence, and each realization plan must be constrained by (i) compatibility of sequences (e.g. "raise" is not compatible with "hold-call") and (ii) information sets available to the player. These constraints are linear:<br />
<br />
$$<br />
Eu = e \\<br />
Fv = f<br />
$$<br />
<br />
where <math> e = f = (1, 0, ..., 0)^T </math> and <math> E, F</math> contain entries in <math> {-1, 0, 1} </math> describing compatibility and information sets. </li><br />
<br />
</ol> <br />
<br />
<br />
The paper's main contribution is to develop a minimax problem for extensive form games:<br />
<br />
<br />
$$<br />
\min_u \max_v u^T P v + \sum_{t\in \mathcal{I}_1} \sum_{c \in C_t} u_c \log \frac{u_c}{u_{p_t}} - \sum_{t\in \mathcal{I}_2} \sum_{c \in C_t} v_c \log \frac{v_c}{v_{p_t}}<br />
$$<br />
<br />
where <math> p_t </math> is the action immediately preceding information set <math> t </math>. Intuitively, each sum resembles a cross-entropy over the distribution of probabilities in the realization plan comparing each probability to proceed from the information set to the probability to arrive at that information set. Importantly, these entropies are strictly convex or concave (for player 1 and player 2 respectively) [3] so that the min-max problem will have a unique solution and ''the objective function is continuous and continuously differentiable'' - this means there is a way to optimize the function. As noted in Theorem 1 of [1], the solution to this problem is equivalently a solution for the QRE of the game in reduced normal form.<br />
<br />
Minimax can also be seen from an algorithmic perspective. Referring to the above figure containing a tree, it contains a sequence of states and action which alternates between two or more competing players. The above formulation of the min-max problem essentially measures how well a decision rule is from the perspective of a single player. To describe it in terms of the tree, if it is player 1's turn, then it is a mutual recursion of player 1 choosing to maximize its payoff and player 2 choosing to minimize player 1's payoff.<br />
<br />
Having decided on a cost function, the method of Lagrange multipliers may be used to construct the Lagrangian that encodes the known constraints (<math> Eu = e \,, Fv = f </math>, and <math> u, v \geq 0</math>), and then optimize the Lagrangian using Newton's method (identically to the previous section). Accounting for the constraints, the Lagrangian becomes <br />
<br />
<br />
$$<br />
\mathcal{L} = g(u, v) + \sum_i \mu_i(Eu - e)_i + \sum_i \nu_i (Fv - f)_i<br />
$$<br />
<br />
where <math>g</math> is the argument from the minimax statement above and <math>u, v \geq 0</math> become KKT conditions. The general update rule for Newton's method may be written in terms of the derivatives of <math> \mathcal{L} </math> with respect to primal variables <math>u, v </math> and dual variables <math> \mu, \nu</math>, yielding:<br />
<br />
$$<br />
\nabla_{u,v,\mu,\nu}^2 \mathcal{L} \cdot (\Delta u, \Delta v, \Delta \mu, \Delta \nu)^T= - \nabla_{u,v,\mu,\nu} \mathcal{L}<br />
$$<br />
where <math>\nabla_{u,v,\mu,\nu}^2 \mathcal{L} </math> is the Hessian of the Lagrangian and <math>\nabla_{u,v,\mu,\nu} \mathcal{L} </math> is simply a column vector of the KKT stationarity conditions. Combined with the previous section, this completes the goal of the paper: To construct a differentiable problem for learning normal form and extensive form games.<br />
<br />
== Experiments ==<br />
<br />
The authors demonstrated learning on extensive form games in the presence of ''side information'', with ''partial observations'' using three experiments. In all cases, the goal was to maximize the likelihood of realizing an observed sequence from the player, assuming they act in accordance with the QRE. The authors found that the best way to implement the module was to use a medium to large batch size, RMSProp, or Adam optimizers with a learning rate between <math>\left[0.0001,0.01\right].</math> <br />
<br />
=== Rock, Paper, Scissors ===<br />
<br />
Rock, Paper, Scissors is a 2-player zero-sum game. For this game, the best strategy to reach a Nash equilibrium and a Quantal response equilibrium is to uniformly play each hand with equal odds.<br />
The first experiment was to learn a non-symmetric variant of Rock, Paper, Scissors with ''incomplete information'' with the following payoff matrix:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix of modified Rock-Paper-Scissors''<br />
! <br />
! ''Rock''<br />
! ''Paper''<br />
! ''Scissors''<br />
|-<br />
! ''Rock''<br />
| '''''0'''''<br />
| <math>-b_1</math><br />
| <math>b_2</math><br />
|-<br />
! ''Paper''<br />
| <math>b_1</math><br />
| '''''0'''''<br />
| <math>-b_3</math><br />
|-<br />
! ''Scissors''<br />
| <math>-b_2</math><br />
| <math>b_3</math><br />
| '''''0'''''<br />
|}<br />
<br />
where each of the <math> b </math> ’s are linear function of some features <math> x \in \mathbb{R}^{2} </math> (i.e., <math> b_y = x^Tw_y </math>, <math> y \in </math> {<math>1,2,3</math>} , where <math> w_y </math> are to be learned by the algorithm). Using many trials of random rewards the technique produced the following results for optimal strategies[1]: <br />
<br />
[[File:RPS Results.png|500px ]]<br />
<br />
From the graphs above, we can tell the following: 1) both parameters learned and predicted strategies improve with a larger dataset; 2) with a reasonably sized dataset, >1000 here, convergence is stable and is fairly quick.<br />
<br />
=== One-Card Poker ===<br />
<br />
Next they investigated extensive form games using the one-Card Poker (with ''imperfect information'') introduced in the previous section. In the experimental setup, they used a deck stacked non-uniformly (meaning repeat cards were allowed). Their goal was to learn this distribution of cards from observations of many rounds of the play. Different from the distribution of cards dealt, the method built in the paper is more suited to learn the player’s perceived or believed distribution of cards. It may even be a function of contextual features such as demographics of players. Three experiments were run with <math> n=4 </math>. Each experiment comprised 5 runs of training, with same weights but different training sets. Let <math> d \in \mathbb{R}^{n}, d \ge 0, \sum_{i} d_i = 1 </math> be the weights of the cards. The probability that the players are dealt cards <math> (i,j) </math> is <math> \frac{d_i d_j}{1-d_i} </math>. This distribution is asymmetric between players. Matrix <math> P, E, F </math> for the case <math> n=4 </math> are presented in [1]. With training for 2500 epochs, the mean squared error of learned parameters (card weights, <math> u, v </math> ) are averaged over all runs of and are presented as following [1]: <br />
<br />
<br />
[[File:One-card_Poker_Results.png|500px ]]<br />
<br />
=== Security Resource Allocation Game ===<br />
<br />
<br />
From Security Resource Allocation Game, they demonstrated the ability to learn from ''imperfect observations''. The defender possesses <math> k </math> indistinguishable and indivisible defensive resources which he splits among <math> n </math> targets, { <math> T_1, ……, T_n </math>}. The attacker chooses one target. If the attack succeeds, the attacker gets <math> R_i </math> reward and defender gets <math> -R_i </math>, otherwise zero payoff for both. If there are n defenders guarding <math> T_i </math>, probability of successful attack on <math> T_i </math> is <math> \frac{1}{2^n} </math>. The expected payoff matrix when <math> n = 2, k = 3 </math>, where the attackers are the row players is:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix when <math> n = 2, k = 3 </math>''<br />
! {#<math>D_1</math>,#<math>D_2</math>}<br />
! {0, 3}<br />
! {1, 2}<br />
! {2, 1}<br />
! {3, 0}<br />
|-<br />
! <math>T_1</math><br />
| <math>-R_1</math><br />
| <math>-\frac{1}{2}R_1</math><br />
| <math>-\frac{1}{4}R_1</math><br />
| <math>-\frac{1}{8}R_1</math><br />
|-<br />
! <math>T_2</math><br />
| <math>-\frac{1}{8}R_2</math><br />
| <math>-\frac{1}{4}R_2</math><br />
| <math>-\frac{1}{2}R_2</math><br />
| <math>-R_2</math><br />
|} <br />
<br />
<br />
For a multi-stage game the attacker can launch <math> t </math> attacks, one in each stage while defender can only stick with stage 1. The attacker may change target if the attack in stage 1 is failed. Three experiments are run with <math> n = 2, k = 5 </math> for games with single attack and double attack, i.e, <math> t = 1 </math> and <math> t = 2 </math>. The results of simulated experiments are shown below [1]:<br />
<br />
[[File:Security Game Results.png|500px ]]<br />
<br />
<br />
They learned <math> R_i </math> only based on observations of the defender’s actions and could still recover the game set by only observing the defender’s actions. Same as expectation, the larger dataset size improves the learned parameters. Two outliers are 1) Security Game, the green plot for when <math> t = 2 </math>; and 2) RPS, when comparing between training sizes of 2000 and 5000.<br />
<br />
== Conclusion ==<br />
Unsurprisingly, the results of this study show that in general, the quality of learned parameters improved as the number of observations increased. The network presented in this paper demonstrated improvement over the existing methodology. <br />
<br />
This paper presents an end-to-end framework for implementing a game solver, for both extensive and normal form, as a module in a deep neural network for zero-sum games. This method, unlike many previous works in this area, does not require the parameters of the game to be known to the agent prior to the start of the game. The two-part method analytically computes both the optimal solution and the parameters of the game. Future work involves taking advantage of the KKT matrix structure to increase computation speed and extensions to the area of learning general-sum games.<br />
<br />
== Critiques ==<br />
The proposed method appears to suffer from two flaws. Firstly, the assumption that players behave in accordance with the QRE severely limits the space of player strategies and is known to exhibit pathological behavior even in one-player settings. Second, the solvers are computationally inefficient and are unable to scale. For this setting, they used Newton's method and it is found that the second-order algorithms do not scale to large games as with Nash equilibrium [4]. <br />
<br />
In the one-card poker section, it might be better to write "Next, they investigated extensive form games ... It may even be a function of contextual features such as the demographics of players. ... 5 runs of training, with the same weights but different training sets."<br />
<br />
Zero-Sum Normal Form Games usually follow Gaussian distributions with two non-empty sets of strategies of player one and player two correspondingly, and also a payoff function defined on the set of possible realizations.<br />
<br />
This method of proposing that real players will follow a laid out scheme or playing strategy is almost always a drastic oversimplification of real-world scenarios and severely limits the applications. In order to better fit a model to the real players, there almost always has to be some sort of stochastic implementation which accounts for the presence of irrational users in the system. <br />
<br />
It might be a good idea to discuss the algorithm performance based on more complicated player reactions, for example, player 2 or NPC from the game might react based on player 1's action. If player 2 would react with a "clever" choice, we might be able to shrink the choice space, which might be helpful to accelerate the training time.<br />
<br />
The theory assumes rational players, which means that roughly speaking, the players make decisions based on increasing their respective payoffs (utility values, preferences,..). However, in real-life scenarios, this might not hold true. Thus, it would be a good idea to include considerations regarding this issue in this summary. Another area that is worth exploring is the need for things like “bluffing” in games with hidden information. It would be interesting to see whether the results generated in this paper are in support of "bluffing" or not.<br />
<br />
It is good that the authors of this paper provided analysis on different types of games, but it would be great if they could also provide some future insights on this method such as whether it can be applied to more complex games such as blackjack poker / it would work or not.<br />
<br />
The authors evaluated the performance by plotting the mean squared error (MSE) of the training data. Prediction error on testing data can also be provided to better evaluate the prediction power.<br />
<br />
== References ==<br />
<br />
[1] Ling, C. K., Fang, F., & Kolter, J. Z. (2018). What game are we playing? end-to-end learning in normal and extensive form games. arXiv preprint arXiv:1805.02777.<br />
<br />
[2] B. von Stengel. Efficient computation of behavior strategies.Games and Economics Behavior,14(0050):220–246, 1996.<br />
<br />
[3] Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.<br />
<br />
[4] Farina, G., Kroer, C., & Sandholm, T. (2019). Online Convex Optimization for Sequential Decision Processes and Extensive-Form Games. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 1917-1925. https://doi.org/10.1609/aaai.v33i01.33011917<br />
<br />
[5] Larson, K. Game-theoretic methods for computer science Normal Form Games. Lecture presented at CS 886 in University of Waterloo, Waterloo. Retrieved from https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiL24iwprrtAhWJFFkFHU52D0MQFjAQegQIIhAC&url=https://cs.uwaterloo.ca/~klarson/teaching/F06-886/slides/normalform.pdf&usg=AOvVaw0j9fw9oRkUGZOsY8ZCyv6i</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Yktan&diff=49712User:Yktan2020-12-07T03:52:21Z<p>X277lu: /* Critiques/ Insights */</p>
<hr />
<div><br />
== Introduction ==<br />
<br />
Much of the success in training deep neural networks (DNNs) is due to the collection of large datasets with human-annotated labels. However, human annotation is both a time-consuming and expensive task, especially for data that requires expertise such as medical data. Furthermore, certain datasets can be noisy due to the biases introduced by different annotators. Data obtained in large quantities through searching for images in search engines and data downloaded from social media sites (in a manner abiding by privacy and copyright laws) are especially noisy, since the labels are generally inferred from tags to save on human-annotation cost. <br />
<br />
There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. Other LNL methods estimate a noise transition matrix and employ it to correct the loss function. An example of a popular loss correction approach is the bootstrapping loss approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples. The main limitation of these methods is that they do not perform well under high noise ratio and cause overfitting.<br />
<br />
This paper introduces DivideMix, which combines approaches from LNL and SSL. One unique thing about DivideMix is that it discards sample labels that are highly likely to be noisy and leverages these noisy samples as unlabeled data instead. This prevents the model from overfitting and improves generalization performance. Key contributions of this work are:<br />
1) Co-divide, which trains two networks simultaneously, aims to improve generalization and avoid confirmation bias.<br />
2) During the SSL phase, an improvement is made on an existing method (MixMatch) by combining it with another method (MixUp).<br />
3) Significant improvements to state-of-the-art results on multiple conditions are experimentally shown while using DivideMix. Extensive ablation study and qualitative results are also shown to examine the effect of different components.<br />
<br />
== Motivation ==<br />
<br />
While much has been achieved in training DNNs with noisy labels and SSL methods individually, not much progress has been made in exploring their underlying connections and building on top of the two approaches simultaneously. <br />
<br />
Existing LNL methods aim to correct the loss function by:<br />
<ol><br />
<li> Treating all samples equally and correcting loss explicitly or implicitly through relabelling of the noisy samples<br />
<li> Reweighting training samples or separating clean and noisy samples, which results in correction of the loss function<br />
</ol><br />
<br />
A few examples of LNL methods include:<br />
<ol><br />
<li> Estimating the noise transition matrix, which denotes the probability of clean labels flipping to noisy labels, to correct the loss function<br />
<li> Leveraging the predictions from DNNs to correct labels and using them to modify the loss<br />
<li> Reweighting samples so that noisy labels contribute less to the loss<br />
</ol><br />
<br />
However, these methods all have downsides: it is very challenging to correctly estimate the noise transition matrix in the first method; for the second method, DNNs tend to overfit to datasets with high noise ratio; and for the third method, we need to be able to identify clean samples, which has also proven to be challenging.<br />
<br />
On the other hand, SSL methods mostly leverage unlabeled data using regularization to improve model performance. A recently proposed method, MixMatch, incorporates the two classes of regularization. These classes are consistency regularization which enforces the model to produce consistent predictions on augmented input data, and entropy minimization which encourages the model to give high-confidence predictions on unlabeled data, as well as MixUp regularization. <br />
<br />
DivideMix partially adopts LNL in that it removes the labels that are highly likely to be noisy by using co-divide to avoid the confirmation bias problem. It then utilizes the noisy samples as unlabeled data and adopts an improved version of MixMatch (an SSL technique) which accounts for the label noise during the label co-refinement and co-guessing phase. By incorporating SSL techniques into LNL and taking the best of both worlds, DivideMix aims to produce highly promising results in training DNNs by better addressing the confirmation bias problem, more accurately distinguishing and utilizing noisy samples, and performing well under high levels of noise.<br />
<br />
== Model Architecture and Algorithm ==<br />
<br />
DivideMix leverages semi-supervised learning to achieve effective modeling. The sample is first split into a labeled set and an unlabeled set. This is achieved by fitting a Gaussian Mixture Model as a per-sample loss distribution. The unlabeled set is made up of data points with discarded labels deemed noisy. <br />
<br />
Then, to avoid confirmation bias, which is typical when a model is self-training, two models are being trained simultaneously to filter error for each other. This is done by dividing the data into clean labeled set (X)and a noisy unlabeled set (U) using one model and then training the other model on the data set U. This algorithm, known as Co-divide, keeps the two networks from converging when training, which avoids the bias from occurring. Gaussian Mixture Model is better at distinguishing X and U, whereas Beta Mixture Model produces flat distribution and fails to label correctly. Being diverged also offers the two networks distinct abilities to filter different types of error, making the model more robust to noise. However, the model could still have confirmation error where both model would prone to make and confirm the same mistake. Figure 1 describes the algorithm in graphical form.<br />
<br />
[[File:ModelArchitecture.PNG | center]]<br />
<br />
<div align="center">Figure 1: Model Architecture of DivideMix</div><br />
<br />
For each epoch, the network divides the dataset into a labeled set consisting of clean data, and an unlabeled set consisting of noisy data, which is then used as training data for the other network, where training is done in mini-batches. For each batch of the labelled samples, co-refinement is performed by using the ground truth label <math> y_b </math>, the predicted label <math> p_b </math>, and the posterior is used as the weight, <math> w_b </math>. <br />
<br />
<center><math> \bar{y}_b = w_b y_b + (1-w_b) p_b </math></center> <br />
<br />
Then, a sharpening function is implemented on this weighted sum to produce the estimate with reduced temperature, <math> \hat{y}_b </math>. <br />
<br />
<center><math> \hat{y}_b=Sharpen(\bar{y}_b,T)={\bar{y}^{c{\frac{1}{T}}}_b}/{\sum_{c=1}^C\bar{y}^{c{\frac{1}{T}}}_b} </math>, for <math>c = 1, 2,..,C</math></center><br />
<br />
Using all these predicted labels, the unlabeled samples will then be assigned a "co-guessed" label, which should produce a more accurate prediction. Having calculated all these labels, MixMatch is applied to the combined mini-batch of labeled, <math> \hat{X} </math> and unlabeled data, <math> \hat{U} </math>, where, for a pair of samples and their labels, one new sample and new label is produced. More specifically, for a pair of samples <math> (x_1,x_2) </math> and their labels <math> (p_1,p_2) </math>, the mixed sample <math> (x',p') </math> is:<br />
<br />
<center><br />
<math><br />
\begin{alignat}{2}<br />
<br />
\lambda &\sim Beta(\alpha, \alpha) \\<br />
\lambda ' &= max(\lambda, 1 - \lambda) \\<br />
x' &= \lambda ' x_1 + (1 - \lambda ' ) x_2 \\<br />
p' &= \lambda ' p_1 + (1 - \lambda ' ) p_2 \\<br />
<br />
\end{alignat}<br />
</math><br />
</center> <br />
<br />
MixMatch transforms <math> \hat{X} </math> and <math> \hat{U} </math> into <math> X' </math> and <math> U' </math>. Then, the loss on <math> X' </math>, <math> L_X </math> (Cross-entropy loss) and the loss on <math> U' </math>, <math> L_U </math> (Mean Squared Error) are calculated. A regularization term, <math> L_{reg} </math>, is introduced to regularize the model's average output across all samples in the mini-batch. Then, the total loss is calculated as:<br />
<br />
<center><math> L = L_X + \lambda_u L_U + \lambda_r L_{reg} </math></center> <br />
<br />
where <math> \lambda_r </math> is set to 1, and <math> \lambda_u </math> is used to control the unsupervised loss.<br />
<br />
Lastly, the stochastic gradient descent formula is updated with the calculated loss, <math> L </math>, and the estimated parameters, <math> \boldsymbol{ \theta } </math>.<br />
<br />
The full algorithm is shown below. [[File:dividemix.jpg|600px| | center]]<br />
<div align="center">Algorithm1: DivideMix. Line 4-8: co-divide; Line 17-18: label co-refinement; Line 20: co-guessing.</div><br />
<br />
Then, when the model is warmed up, it is trained on all data using standard cross-entropy to initially converge the model, but with a regulatory negative entropy term <math>\mathcal{H} = -\sum_{c}\text{p}^\text{c}_\text{model}(x;\theta)\log(\text{p}^\text{c}_\text{model}(x;\theta))</math>, where <math>\text{p}^\text{c}_\text{model}</math> is the softmax output probability for class c. This term penalizes confident predictions during the warm up to prevent overfitting to noise during the warm up, which can happen when there is asymmetric noise.<br />
<br />
== Results ==<br />
'''Applications'''<br />
<br />
The method was validated using four benchmark datasets: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009) which contain 50K training images and 10K test images of size 32 × 32), Clothing1M (Xiao et al., 2015), and WebVision (Li et al., 2017a).<br />
Two types of label noise are used in the experiments: symmetric and asymmetric.<br />
An 18-layer PreAct Resnet (He et al., 2016) is trained using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. The network is trained for 300 epochs. The initial learning rate was set to 0.02 and reduced by a factor of 10 after 150 epochs. Before applying the Co-divide and MixMatch strategies, the models were first independently trained over the entire dataset using cross-entropy loss during a "warm-up" period. Initially, training the models in this way prepares a more regular distribution of losses to improve upon in subsequent epochs. The warm-up period is 10 epochs for CIFAR-10 and 30 epochs for CIFAR-100. For all CIFAR experiments, we use the same hyperparameters M = 2, T = 0.5, and α = 4. τ is set as 0.5 except for 90% noise ratio when it is set as 0.6.<br />
<br />
<br />
'''Comparison of State-of-the-Art Methods'''<br />
<br />
The effectiveness of DivideMix was shown by comparing the test accuracy with the most recent state-of-the-art methods: <br />
Meta-Learning (Li et al., 2019) proposes a gradient-based method to find model parameters that are more noise-tolerant; <br />
Joint-Optim (Tanaka et al., 2018) and P-correction (Yi & Wu, 2019) jointly optimize the sample labels and the network parameters;<br />
M-correction (Arazo et al., 2019) models sample loss with BMM and apply MixUp.<br />
The following are the results on CIFAR-10 and CIFAR-100 with different levels of symmetric label noise ranging from 20% to 90%. Both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs were recorded in the following table:<br />
<br />
<br />
[[File:divideMixtable1.PNG | center]]<br />
<br />
From table 1, the author noticed that none of these methods can consistently outperform others across different datasets. M-correction excels at symmetric noise, whereas Meta-Learning performs better for asymmetric noise. DivideMix outperforms state-of-the-art methods by a large margin across all noise ratios. The improvement is substantial (∼10% of accuracy) for the more challenging CIFAR-100 with high noise ratios.<br />
<br />
DivideMix was compared with the state-of-the-art methods with the other two datasets: Clothing1M and WebVision. It also shows that DivideMix consistently outperforms state-of-the-art methods across all datasets with different types of label noise. For WebVision, DivideMix achieves more than 12% improvement in top-1 accuracy. <br />
<br />
<br />
'''Ablation Study'''<br />
<br />
The effect of removing different components to provide insights into what makes DivideMix successful. We analyze the results in Table 5 as follows.<br />
<br />
<br />
[[File:DivideMixtable5.PNG | center]]<br />
<br />
The authors combined self-divide with the original MixMatch as a naive baseline for using SLL in LNL.<br />
They also find that both label refinement and input augmentation are beneficial for DivideMix. ''Label refinement'' is important for high noise ratio due because samples that are noisier would be incorrectly divided into the labeled set. ''Augmentation'' upgrades model performance by creating more reliable predictions and by achieving consistent regularization. In addition, the performance drop was seen in the ''DivideMix w/o co-training'' highlights the disadvantage of self-training; the model still has dataset division, label refinement and label guessing, but they are all performed by the same model.<br />
<br />
== Conclusion ==<br />
<br />
This paper provides a new and effective algorithm for learning with noisy labels by using highly noisy data unlabelled data in a Semi-Supervised Learning framework. The DivideMix method trains two networks simultaneously and utilizes co-guessing and co-labeling effectively, therefore it is a robust approach to deal with noise in datasets. Also, the DivideMix method has been tested using various datasets with the results consistently being one of the best when compared to the state-of-the-art methods through extensive experiments.<br />
<br />
Future work of DivideMix is to create an adaptation for other applications such as Natural Language Processing, and incorporating the ideas of SSL and LNL into DivideMix architecture.<br />
<br />
== Critiques/ Insights ==<br />
<br />
1. While combining both models makes the result better, the author did not show the relative time increase using this new combined methodology, which is very crucial considering training a large amount of data, especially for images. In addition, it seems that the author did not perform much on hyperparameters tuning for the combined model.<br />
<br />
2. There is an interesting insight, which is when the noise ratio increases from 80% to 90%, the accuracy of DivideMix drops dramatically in both datasets.<br />
<br />
3. There should be a further explanation of why the learning rate drops by a factor of 10 after 150 epochs.<br />
<br />
4. It would be interesting to see the effectiveness of this method in other domains such as NLP. I am not aware of noisy training datasets available in NLP, but surely this is an important area to focus on, as much of the available data is collected from noisy sources from the web.<br />
<br />
5. The paper implicitly assumes that a Gaussian mixture model (GMM) is sufficiently capable of identifying noise. Given the nature of a GMM, it would work well for noise that is distributed by a Gaussian distribution but for all other noise, it would probably be only asymptotic. The paper should present theoretical results on the noise that are Exponential, Rayleigh, etc. This is particularly important because the experiments were done on massive datasets, but they do not directly address the case when there are not many data points. <br />
<br />
6. Comparing the training result on these benchmark datasets makes the algorithm quite comprehensive. This is a very insightful idea to maintain two networks to avoid bias from occurring.<br />
<br />
7. The current benchmark accuracy for CIFAR-10 is 99.7, CIFAR-100 is 96.08 using EffNet-L2 in 2020. In 2019, CIFAR-10 is 99.37, CIFAR-100 is 93.51 using BiT-L.(based on paperswithcode.com) As there exists better methods, it would be nice to know why the authors chose these state-of-the-art methods to compare the test accuracy.<br />
<br />
8. Another interesting observation is that DivideMix seems to maintain a similar accuracy while some methods give unstable results. That shows the reliability of the proposed algorithm.<br />
<br />
9. It would be interesting to see if the drop in accuracy from increasing the noise ratio to 90% is a result of a low porportion or low number of clean labels. That is, would increasing the size of the training set but keeping the noise ratio at 90% result in increased accuracy?<br />
<br />
10. For Ablation Study part, the paper also introduced a study on the Robustness of Testing Marking Methods Noise, including AUC for classification of clean/noisy samples of CIFAR-10 training data. And it shows that the method can effectively separate clean and noisy samples as training proceeds.<br />
<br />
11. It is interesting how unlike common methods, the method in this paper discards the labels that are highly likely to be<br />
noisy. It also utilizes the noisy samples as unlabeled data to regularize training in a SSL manner. This model can better distinguish and utilize noisy samples.<br />
<br />
12. In the result section, the author gives us a comprehensive understanding of this algorithm by introducing the applications and the comparison of it with respect to similar methods. It would be attractive if in the application part, the author could indicate how the application relative to our daily life.<br />
<br />
13. High quality data is very important for training Machine learning systems. Preparing the data to train ML systems requires data annotations which are prone to errors and are time-consuming. It is interesting to note how paper 14 and this paper aims to approach this problem from different perspectives. Paper 14 introduces CSL algorithm that learns from confused or Noisy data to find the tasks associated with them. And this paper proposes an algorithm that shows good performance when learning from noisy data. Hence both the papers seem to tackle similar problem and implementing the approaches described in both the papers when handling noisy data can be twice helpful.<br />
<br />
14. Noise exists in all big data, and big data is what we are dealing with in real life nowadays. Having an effective noise eliminating method such as Dividemix is important to us.<br />
<br />
15. The DivideMix consistently outperforms state-of-the-art methods across the given datasets, but how about some other potential datasets? If it can be given that it has advantages for a certain type of potential dataset, it will be a better discussion.<br />
<br />
16. It would be better if there was more information regarding four benchmark datasets (CIFAR-10, CIFAR100, Clothing1M, and WebVision) so that readers could be aware of the properties or differences of those datasets.<br />
<br />
17. It would be interesting to compare the computational cost of this model versus others. Would it be more computationally faster to use in real world applications?<br />
<br />
18. It would be more informative if test error curve (test error vs number of epochs) for different methods can also be provided, as the readers can visualize the warm up period and stabilized period.<br />
<br />
== References ==<br />
[1] Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Unsupervised<br />
label noise modeling and loss correction. In ICML, pp. 312–321, 2019.<br />
<br />
[2] David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin<br />
Raffel. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 2019.<br />
<br />
[3] Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach<br />
to learning from noisy labels. In WACV, pp. 1215–1224, 2018.</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Yktan&diff=49708User:Yktan2020-12-07T03:47:29Z<p>X277lu: /* Introduction */</p>
<hr />
<div><br />
== Introduction ==<br />
<br />
Much of the success in training deep neural networks (DNNs) is due to the collection of large datasets with human-annotated labels. However, human annotation is both a time-consuming and expensive task, especially for data that requires expertise such as medical data. Furthermore, certain datasets can be noisy due to the biases introduced by different annotators. Data obtained in large quantities through searching for images in search engines and data downloaded from social media sites (in a manner abiding by privacy and copyright laws) are especially noisy, since the labels are generally inferred from tags to save on human-annotation cost. <br />
<br />
There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. Other LNL methods estimate a noise transition matrix and employ it to correct the loss function. An example of a popular loss correction approach is the bootstrapping loss approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples. The main limitation of these methods is that they do not perform well under high noise ratio and cause overfitting.<br />
<br />
This paper introduces DivideMix, which combines approaches from LNL and SSL. One unique thing about DivideMix is that it discards sample labels that are highly likely to be noisy and leverages these noisy samples as unlabeled data instead. This prevents the model from overfitting and improves generalization performance. Key contributions of this work are:<br />
1) Co-divide, which trains two networks simultaneously, aims to improve generalization and avoid confirmation bias.<br />
2) During the SSL phase, an improvement is made on an existing method (MixMatch) by combining it with another method (MixUp).<br />
3) Significant improvements to state-of-the-art results on multiple conditions are experimentally shown while using DivideMix. Extensive ablation study and qualitative results are also shown to examine the effect of different components.<br />
<br />
== Motivation ==<br />
<br />
While much has been achieved in training DNNs with noisy labels and SSL methods individually, not much progress has been made in exploring their underlying connections and building on top of the two approaches simultaneously. <br />
<br />
Existing LNL methods aim to correct the loss function by:<br />
<ol><br />
<li> Treating all samples equally and correcting loss explicitly or implicitly through relabelling of the noisy samples<br />
<li> Reweighting training samples or separating clean and noisy samples, which results in correction of the loss function<br />
</ol><br />
<br />
A few examples of LNL methods include:<br />
<ol><br />
<li> Estimating the noise transition matrix, which denotes the probability of clean labels flipping to noisy labels, to correct the loss function<br />
<li> Leveraging the predictions from DNNs to correct labels and using them to modify the loss<br />
<li> Reweighting samples so that noisy labels contribute less to the loss<br />
</ol><br />
<br />
However, these methods all have downsides: it is very challenging to correctly estimate the noise transition matrix in the first method; for the second method, DNNs tend to overfit to datasets with high noise ratio; and for the third method, we need to be able to identify clean samples, which has also proven to be challenging.<br />
<br />
On the other hand, SSL methods mostly leverage unlabeled data using regularization to improve model performance. A recently proposed method, MixMatch, incorporates the two classes of regularization. These classes are consistency regularization which enforces the model to produce consistent predictions on augmented input data, and entropy minimization which encourages the model to give high-confidence predictions on unlabeled data, as well as MixUp regularization. <br />
<br />
DivideMix partially adopts LNL in that it removes the labels that are highly likely to be noisy by using co-divide to avoid the confirmation bias problem. It then utilizes the noisy samples as unlabeled data and adopts an improved version of MixMatch (an SSL technique) which accounts for the label noise during the label co-refinement and co-guessing phase. By incorporating SSL techniques into LNL and taking the best of both worlds, DivideMix aims to produce highly promising results in training DNNs by better addressing the confirmation bias problem, more accurately distinguishing and utilizing noisy samples, and performing well under high levels of noise.<br />
<br />
== Model Architecture and Algorithm ==<br />
<br />
DivideMix leverages semi-supervised learning to achieve effective modeling. The sample is first split into a labeled set and an unlabeled set. This is achieved by fitting a Gaussian Mixture Model as a per-sample loss distribution. The unlabeled set is made up of data points with discarded labels deemed noisy. <br />
<br />
Then, to avoid confirmation bias, which is typical when a model is self-training, two models are being trained simultaneously to filter error for each other. This is done by dividing the data into clean labeled set (X)and a noisy unlabeled set (U) using one model and then training the other model on the data set U. This algorithm, known as Co-divide, keeps the two networks from converging when training, which avoids the bias from occurring. Gaussian Mixture Model is better at distinguishing X and U, whereas Beta Mixture Model produces flat distribution and fails to label correctly. Being diverged also offers the two networks distinct abilities to filter different types of error, making the model more robust to noise. However, the model could still have confirmation error where both model would prone to make and confirm the same mistake. Figure 1 describes the algorithm in graphical form.<br />
<br />
[[File:ModelArchitecture.PNG | center]]<br />
<br />
<div align="center">Figure 1: Model Architecture of DivideMix</div><br />
<br />
For each epoch, the network divides the dataset into a labeled set consisting of clean data, and an unlabeled set consisting of noisy data, which is then used as training data for the other network, where training is done in mini-batches. For each batch of the labelled samples, co-refinement is performed by using the ground truth label <math> y_b </math>, the predicted label <math> p_b </math>, and the posterior is used as the weight, <math> w_b </math>. <br />
<br />
<center><math> \bar{y}_b = w_b y_b + (1-w_b) p_b </math></center> <br />
<br />
Then, a sharpening function is implemented on this weighted sum to produce the estimate with reduced temperature, <math> \hat{y}_b </math>. <br />
<br />
<center><math> \hat{y}_b=Sharpen(\bar{y}_b,T)={\bar{y}^{c{\frac{1}{T}}}_b}/{\sum_{c=1}^C\bar{y}^{c{\frac{1}{T}}}_b} </math>, for <math>c = 1, 2,..,C</math></center><br />
<br />
Using all these predicted labels, the unlabeled samples will then be assigned a "co-guessed" label, which should produce a more accurate prediction. Having calculated all these labels, MixMatch is applied to the combined mini-batch of labeled, <math> \hat{X} </math> and unlabeled data, <math> \hat{U} </math>, where, for a pair of samples and their labels, one new sample and new label is produced. More specifically, for a pair of samples <math> (x_1,x_2) </math> and their labels <math> (p_1,p_2) </math>, the mixed sample <math> (x',p') </math> is:<br />
<br />
<center><br />
<math><br />
\begin{alignat}{2}<br />
<br />
\lambda &\sim Beta(\alpha, \alpha) \\<br />
\lambda ' &= max(\lambda, 1 - \lambda) \\<br />
x' &= \lambda ' x_1 + (1 - \lambda ' ) x_2 \\<br />
p' &= \lambda ' p_1 + (1 - \lambda ' ) p_2 \\<br />
<br />
\end{alignat}<br />
</math><br />
</center> <br />
<br />
MixMatch transforms <math> \hat{X} </math> and <math> \hat{U} </math> into <math> X' </math> and <math> U' </math>. Then, the loss on <math> X' </math>, <math> L_X </math> (Cross-entropy loss) and the loss on <math> U' </math>, <math> L_U </math> (Mean Squared Error) are calculated. A regularization term, <math> L_{reg} </math>, is introduced to regularize the model's average output across all samples in the mini-batch. Then, the total loss is calculated as:<br />
<br />
<center><math> L = L_X + \lambda_u L_U + \lambda_r L_{reg} </math></center> <br />
<br />
where <math> \lambda_r </math> is set to 1, and <math> \lambda_u </math> is used to control the unsupervised loss.<br />
<br />
Lastly, the stochastic gradient descent formula is updated with the calculated loss, <math> L </math>, and the estimated parameters, <math> \boldsymbol{ \theta } </math>.<br />
<br />
The full algorithm is shown below. [[File:dividemix.jpg|600px| | center]]<br />
<div align="center">Algorithm1: DivideMix. Line 4-8: co-divide; Line 17-18: label co-refinement; Line 20: co-guessing.</div><br />
<br />
Then, when the model is warmed up, it is trained on all data using standard cross-entropy to initially converge the model, but with a regulatory negative entropy term <math>\mathcal{H} = -\sum_{c}\text{p}^\text{c}_\text{model}(x;\theta)\log(\text{p}^\text{c}_\text{model}(x;\theta))</math>, where <math>\text{p}^\text{c}_\text{model}</math> is the softmax output probability for class c. This term penalizes confident predictions during the warm up to prevent overfitting to noise during the warm up, which can happen when there is asymmetric noise.<br />
<br />
== Results ==<br />
'''Applications'''<br />
<br />
The method was validated using four benchmark datasets: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009) which contain 50K training images and 10K test images of size 32 × 32), Clothing1M (Xiao et al., 2015), and WebVision (Li et al., 2017a).<br />
Two types of label noise are used in the experiments: symmetric and asymmetric.<br />
An 18-layer PreAct Resnet (He et al., 2016) is trained using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. The network is trained for 300 epochs. The initial learning rate was set to 0.02 and reduced by a factor of 10 after 150 epochs. Before applying the Co-divide and MixMatch strategies, the models were first independently trained over the entire dataset using cross-entropy loss during a "warm-up" period. Initially, training the models in this way prepares a more regular distribution of losses to improve upon in subsequent epochs. The warm-up period is 10 epochs for CIFAR-10 and 30 epochs for CIFAR-100. For all CIFAR experiments, we use the same hyperparameters M = 2, T = 0.5, and α = 4. τ is set as 0.5 except for 90% noise ratio when it is set as 0.6.<br />
<br />
<br />
'''Comparison of State-of-the-Art Methods'''<br />
<br />
The effectiveness of DivideMix was shown by comparing the test accuracy with the most recent state-of-the-art methods: <br />
Meta-Learning (Li et al., 2019) proposes a gradient-based method to find model parameters that are more noise-tolerant; <br />
Joint-Optim (Tanaka et al., 2018) and P-correction (Yi & Wu, 2019) jointly optimize the sample labels and the network parameters;<br />
M-correction (Arazo et al., 2019) models sample loss with BMM and apply MixUp.<br />
The following are the results on CIFAR-10 and CIFAR-100 with different levels of symmetric label noise ranging from 20% to 90%. Both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs were recorded in the following table:<br />
<br />
<br />
[[File:divideMixtable1.PNG | center]]<br />
<br />
From table 1, the author noticed that none of these methods can consistently outperform others across different datasets. M-correction excels at symmetric noise, whereas Meta-Learning performs better for asymmetric noise. DivideMix outperforms state-of-the-art methods by a large margin across all noise ratios. The improvement is substantial (∼10% of accuracy) for the more challenging CIFAR-100 with high noise ratios.<br />
<br />
DivideMix was compared with the state-of-the-art methods with the other two datasets: Clothing1M and WebVision. It also shows that DivideMix consistently outperforms state-of-the-art methods across all datasets with different types of label noise. For WebVision, DivideMix achieves more than 12% improvement in top-1 accuracy. <br />
<br />
<br />
'''Ablation Study'''<br />
<br />
The effect of removing different components to provide insights into what makes DivideMix successful. We analyze the results in Table 5 as follows.<br />
<br />
<br />
[[File:DivideMixtable5.PNG | center]]<br />
<br />
The authors combined self-divide with the original MixMatch as a naive baseline for using SLL in LNL.<br />
They also find that both label refinement and input augmentation are beneficial for DivideMix. ''Label refinement'' is important for high noise ratio due because samples that are noisier would be incorrectly divided into the labeled set. ''Augmentation'' upgrades model performance by creating more reliable predictions and by achieving consistent regularization. In addition, the performance drop was seen in the ''DivideMix w/o co-training'' highlights the disadvantage of self-training; the model still has dataset division, label refinement and label guessing, but they are all performed by the same model.<br />
<br />
== Conclusion ==<br />
<br />
This paper provides a new and effective algorithm for learning with noisy labels by using highly noisy data unlabelled data in a Semi-Supervised Learning framework. The DivideMix method trains two networks simultaneously and utilizes co-guessing and co-labeling effectively, therefore it is a robust approach to deal with noise in datasets. Also, the DivideMix method has been tested using various datasets with the results consistently being one of the best when compared to the state-of-the-art methods through extensive experiments.<br />
<br />
Future work of DivideMix is to create an adaptation for other applications such as Natural Language Processing, and incorporating the ideas of SSL and LNL into DivideMix architecture.<br />
<br />
== Critiques/ Insights ==<br />
<br />
1. While combining both models makes the result better, the author did not show the relative time increase using this new combined methodology, which is very crucial considering training a large amount of data, especially for images. In addition, it seems that the author did not perform much on hyperparameters tuning for the combined model.<br />
<br />
2. There is an interesting insight, which is when the noise ratio increases from 80% to 90%, the accuracy of DivideMix drops dramatically in both datasets.<br />
<br />
3. There should be a further explanation of why the learning rate drops by a factor of 10 after 150 epochs.<br />
<br />
4. It would be interesting to see the effectiveness of this method in other domains such as NLP. I am not aware of noisy training datasets available in NLP, but surely this is an important area to focus on, as much of the available data is collected from noisy sources from the web.<br />
<br />
5. The paper implicitly assumes that a Gaussian mixture model (GMM) is sufficiently capable of identifying noise. Given the nature of a GMM, it would work well for noise that is distributed by a Gaussian distribution but for all other noise, it would probably be only asymptotic. The paper should present theoretical results on the noise that are Exponential, Rayleigh, etc. This is particularly important because the experiments were done on massive datasets, but they do not directly address the case when there are not many data points. <br />
<br />
6. Comparing the training result on these benchmark datasets makes the algorithm quite comprehensive. This is a very insightful idea to maintain two networks to avoid bias from occurring.<br />
<br />
7. The current benchmark accuracy for CIFAR-10 is 99.7, CIFAR-100 is 96.08 using EffNet-L2 in 2020. In 2019, CIFAR-10 is 99.37, CIFAR-100 is 93.51 using BiT-L.(based on paperswithcode.com) As there exists better methods, it would be nice to know why the authors chose these state-of-the-art methods to compare the test accuracy.<br />
<br />
8. Another interesting observation is that DivideMix seems to maintain a similar accuracy while some methods give unstable results. That shows the reliability of the proposed algorithm.<br />
<br />
9. It would be interesting to see if the drop in accuracy from increasing the noise ratio to 90% is a result of a low porportion or low number of clean labels. That is, would increasing the size of the training set but keeping the noise ratio at 90% result in increased accuracy?<br />
<br />
10. For Ablation Study part, the paper also introduced a study on the Robustness of Testing Marking Methods Noise, including AUC for classification of clean/noisy samples of CIFAR-10 training data. And it shows that the method can effectively separate clean and noisy samples as training proceeds.<br />
<br />
11. It is interesting how unlike common methods, the method in this paper discards the labels that are highly likely to be<br />
noisy. It also utilizes the noisy samples as unlabeled data to regularize training in a SSL manner. This model can better distinguish and utilize noisy samples.<br />
<br />
12. In the result section, the author gives us a comprehensive understanding of this algorithm by introducing the applications and the comparison of it with respect to similar methods. It would be attractive if in the application part, the author could indicate how the application relative to our daily life.<br />
<br />
13. High quality data is very important for training Machine learning systems. Preparing the data to train ML systems requires data annotations which are prone to errors and are time-consuming. It is interesting to note how paper 14 and this paper aims to approach this problem from different perspectives. Paper 14 introduces CSL algorithm that learns from confused or Noisy data to find the tasks associated with them. And this paper proposes an algorithm that shows good performance when learning from noisy data. Hence both the papers seem to tackle similar problem and implementing the approaches described in both the papers when handling noisy data can be twice helpful.<br />
<br />
14. Noise exists in all big data, and big data is what we are dealing with in real life nowadays. Having an effective noise eliminating method such as Dividemix is important to us.<br />
<br />
15. The DivideMix consistently outperforms state-of-the-art methods across the given datasets, but how about some other potential datasets? If it can be given that it has advantages for a certain type of potential dataset, it will be a better discussion.<br />
<br />
16. It would be better if there was more information regarding four benchmark datasets (CIFAR-10, CIFAR100, Clothing1M, and WebVision) so that readers could be aware of the properties or differences of those datasets.<br />
<br />
17. It would be interesting to compare the computational cost of this model versus others. Would it be more computationally faster to use in real world applications?<br />
<br />
== References ==<br />
[1] Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Unsupervised<br />
label noise modeling and loss correction. In ICML, pp. 312–321, 2019.<br />
<br />
[2] David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin<br />
Raffel. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 2019.<br />
<br />
[3] Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach<br />
to learning from noisy labels. In WACV, pp. 1215–1224, 2018.</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Being_Bayesian_about_Categorical_Probability&diff=49704Being Bayesian about Categorical Probability2020-12-07T03:44:22Z<p>X277lu: /* Conclusion and Critiques */</p>
<hr />
<div>== Presented By ==<br />
Evan Li, Jason Pu, Karam Abuaisha, Nicholas Vadivelu<br />
<br />
== Introduction ==<br />
<br />
Since the outputs of neural networks are not probabilities, Softmax (Bridle, 1990) is a staple for neural network’s performing classification -- it exponentiates each logit then normalizes by the sum, giving a distribution over the target classes. Logit is a raw output/prediction of the model which is hard for humans to interpret, thus we transform/normalize these raw values into categories or meaningful numbers for interpretability. However, networks with softmax outputs give no information about uncertainty (Blundell et al., 2015; Gal & Ghahramani, 2016), and the resulting distribution over classes is poorly calibrated (Guo et al., 2017), often giving overconfident predictions even when the classification is wrong. In addition, softmax also raises concerns about overfitting NNs due to its confident predictive behaviors (Xie et al., 2016; Pereyra et al., 2017). To achieve performance with better generalization, some more effective regularization techniques might be required. <br />
<br />
Bayesian Neural Networks (BNNs; MacKay, 1992) can alleviate these issues, but the resulting posteriors over the parameters are often intractable. Approximations such as variational inference (Graves, 2011; Blundell et al., 2015) and Monte Carlo Dropout (Gal & Ghahramani, 2016) can still be expensive or give poor estimates for the posteriors. This work proposes a Bayesian treatment of the output logits of the neural network, treating the targets as a categorical random variable instead of a fixed label. This technique gives a computationally cheap way of being Bayesian to get well-calibrated uncertainty estimates on neural network classifications.<br />
<br />
== Related Work ==<br />
<br />
Using Bayesian Neural Networks is the dominant way of applying Bayesian techniques to neural networks. Many techniques have been developed to make posterior approximation more accurate and scalable, despite these, BNNs do not scale to the state of the art techniques or large data sets. There are techniques to explicitly avoid modeling the full weight posterior that is more scalable, such as with Monte Carlo Dropout (Gal & Ghahramani, 2016) or tracking mean/covariance of the posterior during training (Mandt et al., 2017; Zhang et al., 2018; Maddox et al., 2019; Osawa et al., 2019). Non-Bayesian uncertainty estimation techniques such as deep ensembles (Lakshminarayanan et al., 2017) and temperature scaling (Guo et al., 2017; Neumann et al., 2018).<br />
<br />
== Preliminaries ==<br />
=== Definitions ===<br />
Let's formalize our classification problem and define some notations for the rest of this summary:<br />
<br />
::Dataset:<br />
$$ \mathcal D = \{(x_i,y_i)\} \in (\mathcal X \times \mathcal Y)^N $$<br />
::General classification model<br />
$$ f^W: \mathcal X \to \mathbb R^K $$<br />
::Softmax function: <br />
$$ \phi(x): \mathbb R^K \to [0,1]^K \;\;|\;\; \phi_k(X) = \frac{\exp(f_k^W(x))}{\sum_{k \in K} \exp(f_k^W(x))} $$<br />
::Softmax activated NN:<br />
$$ \phi \;\circ\; f^W: \chi \to \Delta^{K-1} $$<br />
::NN as a true classifier:<br />
$$ arg\max_i \;\circ\; \phi_i \;\circ\; f^W \;:\; \mathcal X \to \mathcal Y $$<br />
<br />
We'll also define the '''count function''' - a <math>K</math>-vector valued function that outputs the occurences of each class coincident with <math>x</math>:<br />
$$ c^{\mathcal D}(x) = \sum_{(x',y') \in \mathcal D} \mathbb y' I(x' = x) $$<br />
<br />
=== Classification With a Neural Network ===<br />
A typical loss function used in classification is cross-entropy, which is defined by<br />
<br />
$$ l_{\rm CE}(\tilde{y},\phi(f^{W}(x)))=-\sum_k \tilde{y_k} \log \phi_k(f^{W}(x))) $$<br />
<br />
,here <math>y_k</math> and <math>\phi_k</math> refers to the actual and predicted categorical distribution for each class. It's well known that optimizing <math>f^W</math> for <math>l_{CE}</math> is equivalent to optimizing for <math>l_{KL}</math>, the <math>KL</math> divergence between the true distribution and the distribution modeled by NN, that is:<br />
$$ l_{KL}(W) = KL(\text{true distribution} \;|\; \text{distribution encoded by }NN(W)) $$<br />
Let's introduce notations for the underlying (true) distributions of our problem. Let <math>(x_0,y_0) \sim (\mathcal X \times \mathcal Y)</math>:<br />
$$ \text{Full Distribution} = F(x,y) = P(x_0 = x,y_0 = y) $$<br />
$$ \text{Marginal Distribution} = P(x) = F(x_0 = x) $$<br />
$$ \text{Point Class Distribution} = P(y_0 = y \;|\; x_0 = x) = F_x(y) $$<br />
Then we have the following factorization:<br />
$$ F(x,y) = P(x,y) = P(y|x)P(x) = F_x(y)F(x) $$<br />
Substitute this into the definition of KL divergence:<br />
$$ = \sum_{(x,y) \in \mathcal X \times \mathcal Y} F(x,y) \log\left(\frac{F(x,y)}{\phi_y(f^W(x))}\right) $$<br />
$$ = \sum_{x \in \mathcal X} F(x) \sum_{y \in \mathcal Y} F(y|x) \log\left( \frac{F(y|x)}{\phi_y(f^W(x))} \right) $$<br />
$$ = \sum_{x \in \mathcal X} F(x) \sum_{y \in \mathcal Y} F_x(y) \log\left( \frac{F_x(y)}{\phi_y(f^W(x))} \right) $$<br />
$$ = \sum_{x \in \mathcal X} F(x) KL(F_x \;||\; \phi\left( f^W(x) \right)) $$<br />
As usual, we don't have an analytic form for <math>l</math> (if we did, this would imply we know <math>F_X</math> meaning we knew the distribution in the first place). Instead, estimate from <math>\mathcal D</math>:<br />
$$ F(x) \approx \hat F(x) = \frac{||c^{\mathcal D}(x)||_1}{N} $$<br />
$$ F_x(y) \approx \hat F_x(y) = \frac{c^{\mathcal D}(x)}{|| c^{\mathcal D}(x) ||_1}$$<br />
$$ \to l_{KL}(W) = \sum_{x \in \mathcal D} \frac{||c^{\mathcal D}(x)||_1}{N} KL \left( \frac{c^{\mathcal D}(x)}{||c^{\mathcal D}(x)||_1} \;||\; \phi(f^W(x)) \right) $$<br />
The approximations <math>\hat F, \hat F_X</math> are often not very good though: consider a typical classification such as MNIST, we would never expect two handwritten digits to produce the exact same image. Hence <math>c^{\mathcal D}(x)</math> is (almost) always going to have a single index 1 and the rest 0. This has implications for our approximations:<br />
$$ \hat F(x) \text{ is uniform for all } x \in \mathcal D $$<br />
$$ \hat F_x(y) \text{ is degenerate for all } x \in \mathcal D $$<br />
This clearly has implications for overfitting: to minimize the KL term in <math>l_{KL}(W)</math> we want <math>\phi(f^W(x))</math> to be very close to <math>\hat F_x(y)</math> at each point - this means that the loss function is in fact encouraging the neural network to output near degenerate distributions! <br />
<br />
'''Label Smoothing'''<br />
<br />
One form of regularization to help this problem is called label smoothing. Instead of using the degenerate $$F_x(y)$$ as a target function, let's "smooth" it (by adding a scaled uniform distribution to it) so it's no longer degenerate:<br />
$$ F'_x(y) = (1-\lambda)\hat F_x(y) + \frac \lambda K \vec 1 $$<br />
<br />
'''BNNs'''<br />
<br />
BBNs balances the complexity of the model and the distance to target distribution without choosing a single beset configuration (one-hot encoding). Specifically, BNNs with the Gaussian Weight prior $$F_x(y) = N (0,T^{-1} I)$$ has score of configuration <math>W</math> measured by the posterior density $$p_W(W|D) = p(D|W)p_W(W), \log(p_W(W)) = T||W||^2_2$$<br />
Here <math>||W||^2_2</math> could be a poor proxy to penalized for the model complexity due to its linear nature.<br />
<br />
== Method ==<br />
The main technical proposal of the paper is a Bayesian framework to estimate the (former) target distribution <math>F_x(y)</math>. That is, we construct a posterior distribution of <math> F_x(y) </math> and use that as our new target distribution. We call it the ''belief matching'' (BM) framework.<br />
<br />
=== Constructing Target Distribution ===<br />
Recall that <math>F_x(y)</math> is a k-categorical probability distribution - its PMF can be fully characterized by k numbers that sum to 1. Hence we can encode any such <math>F_x</math> as a point in <math>\Delta^{k-1}</math>. We'll do exactly that - let's call this vector <math>z</math>:<br />
$$ z \in \Delta^{k-1} $$<br />
$$ \text{prior} = p_{z|x}(z) $$<br />
$$ \text{conditional} = p_{y|z,x}(y) $$<br />
$$ \text{posterior} = p_{z|x,y}(z) $$<br />
Then if we perform inference:<br />
$$ p_{z|x,y}(z) \propto p_{z|x}(z)p_{y|z,x}(y) $$<br />
The distribution chosen to model prior was <math>dir_K(\beta)</math>:<br />
$$ p_{z|x}(z) = \frac{\Gamma(||\beta||_1)}{\prod_{k=1}^K \Gamma(\beta_k)} \prod_{k=1}^K z_k^{\beta_k - 1} $$<br />
Note that by definition of <math>z</math>: <math> p_{y|x,z} = z_y </math>. Since the Dirichlet is a conjugate prior to categorical distributions we have a convenient form for the mean of the posterior:<br />
$$ \bar{p_{z|x,y}}(z) = \frac{\beta + c^{\mathcal D}(x)}{||\beta + c^{\mathcal D}(x)||_1} \propto \beta + c^{\mathcal D}(x) $$<br />
This is in fact a generalization of (uniform) label smoothing (label smoothing is a special case where <math>\beta = \frac 1 K \vec{1} </math>).<br />
<br />
=== Representing Approximate Distribution ===<br />
Our new target distribution is <math>p_{z|x,y}(z)</math> (as opposed to <math>F_x(y)</math>). That is, we want to construct an interpretation of our neural network weights to construct a distribution with support in <math> \Delta^{K-1} </math> - the NN can then be trained so this encoded distribution closely approximates <math>p_{z|x,y}</math>. Let's denote the PMF of this encoded distribution <math>q_{z|x}^W</math>. This is how the BM framework defines it:<br />
$$ \alpha^W(x) := \exp(f^W(x)) $$<br />
$$ q_{z|x}^W(z) = \frac{\Gamma(||\alpha^W(x)||_1)}{\sum_{k=1}^K \Gamma(\alpha_k^W(x))} \prod_{k=1}^K z_{k}^{\alpha_k^W(x) - 1} $$<br />
$$ \to Z^W_x \sim dir(\alpha^W(x)) $$<br />
Apply <math>\log</math> then <math>\exp</math> to <math>q_{z|x}^W</math>:<br />
$$ q^W_{z|x}(z) \propto \exp \left( \sum_k (\alpha_k^W(x) \log(z_k)) - \sum_k \log(z_k) \right) $$<br />
$$ \propto -l_{CE}(\phi(f^W(x)),z) + \frac{K}{||\alpha^W(x)||}KL(\mathcal U_k \;||\; z) $$<br />
It can actually be shown that the mean of <math>Z_x^W</math> is identical to <math>\phi(f^W(x))</math> - in other words, if we output the mean of the encoded distribution of our neural network under the BM framework, it is theoretically identical to a traditional neural network.<br />
<br />
=== Distribution Matching ===<br />
<br />
We now need a way to fit our approximate distribution from our neural network <math>q_{\mathbf{z | x}}^{\mathbf{W}}</math> to our target distribution <math>p_{\mathbf{z|x},y}</math>. The authors achieve this by maximizing the evidence lower bound (ELBO):<br />
<br />
$$l_{EB}(\mathbf y, \alpha^{\mathbf W}(\mathbf x)) = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] - KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; p_{\mathbf{z|x}}) $$<br />
<br />
Each term can be computed analytically:<br />
<br />
$$\mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf W }} \left[\log z_y \right] = \psi(\alpha_y^{\mathbf W} ( \mathbf x )) - \psi(\alpha_0^{\mathbf W} ( \mathbf x )) $$<br />
<br />
Where <math>\psi(\cdot)</math> represents the digamma function (logarithmic derivative of gamma function). Intuitively, we maximize the probability of the correct label. For the KL term:<br />
<br />
$$KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; p_{\mathbf{z|x}}) = \log \frac{\Gamma(a_0^{\mathbf W}(\mathbf x)) \prod_k \Gamma(\beta_k)}{\prod_k \Gamma(\alpha_k^{\mathbf W}(x)) \Gamma (\beta_0)} + \sum_k (\alpha_k^{\mathbf W}(x)-\beta_k)(\psi(\alpha_k^{\mathbf W}(\mathbf x)) - \psi(\alpha_0^{\mathbf W}(\mathbf x)) $$<br />
<br />
In the first term, for intuition, we can ignore <math>\alpha_0</math> and <math>\beta_0</math> since those just calibrate the distributions. Otherwise, we want the ratio of the products to be as close to 1 as possible to minimize the KL. In the second term, we want to minimize the difference between each individual <math>\alpha_k</math> and <math>\beta_k</math>, scaled by the normalized output of the neural network. <br />
<br />
This loss function can be used as a drop-in replacement for the standard softmax cross-entropy, as it has an analytic form and the same time complexity as typical softmax-cross entropy with respect to the number of classes (<math>O(K)</math>).<br />
<br />
=== On Prior Distributions ===<br />
<br />
We must choose our concentration parameter, <math>\beta</math>, for our dirichlet prior. We see our prior essentially disappears as <math>\beta_0 \to 0</math> and becomes stronger as <math>\beta_0 \to \infty</math>. Thus, we want a small <math>\beta_0</math> so the posterior isn't dominated by the prior. But, the authors claim that a small <math>\beta_0</math> makes <math>\alpha_0^{\mathbf W}(\mathbf x)</math> small, which causes <math>\psi (\alpha_0^{\mathbf W}(\mathbf x))</math> to be large, which is problematic for gradient based optimization. In practice, many neural network techniques aim to make <math>\mathbb E [f^{\mathbf W} (\mathbf x)] \approx \mathbf 0</math> and thus <math>\mathbb E [\alpha^{\mathbf W} (\mathbf x)] \approx \mathbf 1</math>, which means making <math>\alpha_0^{\mathbf W}(\mathbf x)</math> small can be counterproductive.<br />
<br />
So, the authors set <math>\beta = \mathbf 1</math> and introduce a new hyperparameter <math>\lambda</math> which is multiplied with the KL term in the ELBO:<br />
<br />
$$l^\lambda_{EB}(\mathbf y, \alpha^{\mathbf W}(\mathbf x)) = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] - \lambda KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; \mathcal P^D (\mathbf 1)) $$<br />
<br />
This stabilizes the optimization, as we can tell from the gradients:<br />
<br />
$$\frac{\partial l_{E B}\left(\mathbf{y}, \alpha^{\mathbf W}(\mathbf{x})\right)}{\partial \alpha_{k}^{\mathbf W}(\mathbf {x})}=\left(\tilde{\mathbf{y}}_{k}-\left(\alpha_{k}^{\mathbf W}(\mathbf{x})-\beta_{k}\right)\right) \psi^{\prime}\left(\alpha_{k}^{\mathbf{W}}(\boldsymbol{x})\right)<br />
-\left(1-\left(\alpha_{0}^{\boldsymbol{W}}(\boldsymbol{x})-\beta_{0}\right)\right) \psi^{\prime}\left(\alpha_{0}^{\boldsymbol{W}}(\boldsymbol{x})\right)$$<br />
<br />
$$\frac{\partial l_{E B}^{\lambda}\left(\mathbf{y}, \alpha^{\mathbf{W}}(\mathbf{x})\right)}{\partial \alpha_{k}^{W}(\mathbf{x})}=\left(\tilde{\mathbf{y}}_{k}-\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})-\lambda\right)\right) \frac{\psi^{\prime}\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})\right)}{\psi^{\prime}\left(\tilde{\alpha}_{0}^{\mathbf W}(\mathbf{x})\right)}<br />
-\left(1-\left(\tilde{\alpha}_{0}^{W}(\mathbf{x})-\lambda K\right)\right)$$<br />
<br />
As we can see, the first expression is affected by the magnitude of <math>\alpha^{\boldsymbol{W}}(\boldsymbol{x})</math>, whereas the second expression is not due to the <math>\frac{\psi^{\prime}\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})\right)}{\psi^{\prime}\left(\tilde{\alpha}_{0}^{\mathbf W}(\mathbf{x})\right)}</math> ratio.<br />
<br />
== Experiments ==<br />
<br />
Throughout the experiments in this paper, the authors employ various models based on residual connections (He et al., 2016 [1]) which are the models used for benchmarking in practice. We will first demonstrate improvements provided by BM, then we will show versatility in other applications. For fairness of comparisons, all configurations in the reference implementation will be fixed. The only additions in the experiments are initial learning rate warm-up and gradient clipping which are extremely helpful for stable training of BM. <br />
<br />
=== Generalization performance === <br />
The paper compares the generalization performance of BM with softmax and MC dropout on CIFAR-10 and CIFAR-100 benchmarks.<br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_T1.png]]<br />
<br />
The next comparison was performed between BM and softmax on the ImageNet benchmark. <br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_T2.png]]<br />
<br />
For both datasets and In all configurations, BM achieves the best generalization and outperforms softmax and MC dropout.<br />
<br />
===== Regularization effect of prior =====<br />
<br />
In theory, BM has 2 regularization effects:<br />
The prior distribution, which smooths the target posterior<br />
Averaging all of the possible categorical probabilities to compute the distribution matching loss<br />
The authors perform an ablation study to examine the 2 effects separately - removing the KL term in the ELBO removes the effect of the prior distribution.<br />
For ResNet-50 on CIFAR-100 and CIFAR-10 the resulting test error rates were 24.69% and 5.68% respectively. <br />
<br />
This demonstrates that both regularization effects are significant since just having one of them improves the generalization performance compared to the softmax baseline, and having both improves the performance even more.<br />
<br />
===== Impact of <math>\beta</math> =====<br />
<br />
The effect of β on generalization performance is studied by training ResNet-18 on CIFAR-10 by tuning the value of β on its own, as well as jointly with λ. It was found that robust generalization performance is obtained for β ∈ [<math>e^{−1}, e^4</math>] when tuning β on its own; and β ∈ [<math>e^{−4}, e^{8}</math>] when tuning β jointly with λ. The figure below shows a plot of the error rate with varying β.<br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_F3.png]]<br />
<br />
=== Uncertainty Representation ===<br />
<br />
One of the big advantages of BM is the ability to represent uncertainty about the prediction. The authors evaluate the uncertainty representation on in-distribution (ID) and out-of-distribution (OOD) samples. <br />
<br />
===== ID uncertainty =====<br />
<br />
For ID (in-distribution) samples, calibration performance is measured, which is a measure of how well the model’s confidence matches its actual accuracy. This measure can be visualized using reliability plots and quantified using a metric called expected calibration error (ECE). ECE is calculated by grouping predictions into M groups based on their confidence score and then finding the absolute difference between the average accuracy and average confidence for each group. We can define the ECE of <math>f^W </math> on <math>D </math> with <math>M</math> groups as <br />
<br />
<center><br />
<math>ECE_M(f^W, D) = \sum^M_{i=1} \frac{|G_i|}{|D|}|acc(G_i) - conf(G_i)|</math><br />
</center><br />
Where <math>G_i</math> is a set of samples int the i-th group defined as <math>G_i = \{j:i/M < max_k\phi_k(f^Wx^{(j)}) \leq (1+i)/M\}</math>, <math>acc(G_i)</math> is an average accuracy in the i-th group and <math>conf(G_i)</math> is an average confidence in the i-th group.<br />
<br />
The figure below is a reliability plot of ResNet-50 on CIFAR-10 and CIFAR-100 with 15 groups. It shows that BM has a significantly better calibration performance than softmax since the confidence matches the accuracy more closely (this is also reflected in the lower ECE).<br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_F4.png]]<br />
<br />
===== OOD uncertainty =====<br />
<br />
Here, the authors quantify uncertainty using predictive entropy - the larger the predictive entropy, the larger the uncertainty about a prediction. <br />
<br />
The figure below is a density plot of the predictive entropy of ResNet-50 on CIFAR-10. It shows that BM provides significantly better uncertainty estimation compared to other methods since BM is the only method that has a clear peak of high predictive entropy for OOD samples which should have high uncertainty. <br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_F5.png]]<br />
<br />
=== Transfer learning ===<br />
<br />
Belief matching applies the Bayesian principle outside the neural network, which means it can easily be applied to already trained models. Thus, belief matching can be employed in transfer learning scenarios. The authors downloaded the ImageNet pre-trained ResNet-50 weights and fine-tuned the weights of the last linear layer for 100 epochs using an Adam optimizer.<br />
<br />
This table shows the test error rates from transfer learning on CIFAR-10, Food-101, and Cars datasets. Belief matching consistently performs better than softmax. <br />
<br />
[[File:being_bayesian_about_categorical_probability_transfer_learning.png]]<br />
<br />
Belief matching was also tested for the predictive uncertainty for out of dataset samples based on CIFAR-10 as the in distribution sample. Looking at the figure below, it is observed that belief matching significantly improves the uncertainty representation of pre-trained models by only fine-tuning the last layer’s weights. Note that belief matching confidently predicts examples in Cars since CIFAR-10 contains the object category automobiles. In comparison, softmax produces confident predictions on all datasets. Thus, belief matching could also be used to enhance the uncertainty representation ability of pre-trained models without sacrificing their generalization performance.<br />
<br />
[[File: being_bayesian_about_categorical_probability_transfer_learning_uncertainty.png]]<br />
<br />
=== Semi-Supervised Learning ===<br />
<br />
Belief matching’s ability to allow neural networks to represent rich information in their predictions can be exploited to aid consistency based loss function for semi-supervised learning. Consistency-based loss functions use unlabelled samples to determine where to promote the robustness of predictions based on stochastic perturbations. This can be done by perturbing the inputs (which is the VAT model) or the networks (which is the pi-model). Both methods minimize the divergence between two categorical probabilities under some perturbations, thus belief matching can be used by the following replacements in the loss functions. The hope is that belief matching can provide better prediction consistencies using its Dirichlet distributions.<br />
<br />
[[File: being_bayesian_about_categorical_probability_semi_supervised_equation.png]]<br />
<br />
The results of training on ResNet28-2 with consistency based loss functions on CIFAR-10 are shown in this table. Belief matching does have lower classification error rates compared to using a softmax.<br />
<br />
[[File:being_bayesian_about_categorical_probability_semi_supervised_table.png]]<br />
<br />
== Conclusion and Critiques ==<br />
<br />
* Bayesian principles can be used to construct the target distribution by using the categorical probability as a random variable rather than a training label. This can be applied to neural network models by replacing only the softmax and cross-entropy loss while improving the generalization performance, uncertainty estimation and well-calibrated behavior. <br />
<br />
* In the future, the authors would like to allow for more expressive distributions in the belief matching framework, such as logistic normal distributions to capture strong semantic similarities among class labels. Furthermore, using input dependent priors would allow for interesting properties that would aid imbalanced datasets and multi-domain learning.<br />
<br />
* Overall I think this summary is very good. The Method(Algorithm) section is described clearly, and the Results section is detailed, with many diagrams illustrating the main points. I just have one technical suggestion: the difference in performance for SOFTMAX and BM differs by model. For example, for RESNEXT-50 model, the difference in top1 is 0.2, whereas, for the RESNEXT-100 model, the difference in the top one is 0.5, which is significantly higher. It's true that BM method generally outperforms SOFTMAX. But seeing the relation between the choice of model and the magnitude of a performance increase could definitely strengthen the paper even further.<br />
<br />
* The summary is good and the topic is interesting. Bayesian is a well know probabilistic model but did not know that it can be used as a neural network. Comparison between softmax and bayesian was interesting and more details would be great.<br />
<br />
* It would be better if there is a future work section to discuss the current shortage and potential improvement. One thing would be that the theoretical part is complex in the process. In addition, optimizing a function is relatively hard if the structure is complex. Is it possible to have a good approximation without having too complex a calculation?<br />
<br />
* Both experiments dealt with image data, however, softmax is used within classification neural networks that range from image to textual data. It would be interesting to see the performance of BM on textual data for text classification problems in addition to image classification.<br />
<br />
* It would be better to briefly explain Bayesian treatment in the introduction part(i.e., considering the categorical probability as a random variable, construct the target distribution by means of the Bayesian inference), and to analyze the importance of considering the categorical probability as a random variable (for example explain it can be adapted to existing deep learning building blocks without huge modifications).<br />
<br />
* Interesting topic that goes close to our lectures. Since this is a summary of the paper, it would be better if trim the explanation on Neural Network a little like getting rid of the substitution lines.<br />
<br />
* I really liked the presentation and actually really appreciate the steps of the detailed derivation that were presented in this summary. In the introduction the researchers mentioned that BM is a computationally cheap method, however, I was wondering how much faster it is computationally as opposed to the other models to train. Additionally, the training data that was used to benchmark the classification performance seemed to all be image classifications (CIFAR-10, CIFAR-100, ResNet-50, ResNet-101), thus it would have been nice to see classification be applied in other multi-class contexts as well to see how well this new method performs there.<br />
<br />
* It would be more clear if the metric used in evaluating the models is briefly explained!<br />
<br />
== Citations ==<br />
<br />
[1] Bridle, J. S. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pp. 227–236. Springer, 1990.<br />
<br />
[2] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. In International Conference on Machine Learning, 2015.<br />
<br />
[3] Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 2016.<br />
<br />
[4] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning, 2017. <br />
<br />
[5] MacKay, D. J. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448– 472, 1992.<br />
<br />
[6] Graves, A. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, 2011. <br />
<br />
[7] Mandt, S., Hoffman, M. D., and Blei, D. M. Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research, 18(1):4873–4907, 2017.<br />
<br />
[8] Zhang, G., Sun, S., Duvenaud, D., and Grosse, R. Noisy natural gradient as variational inference. In International Conference of Machine Learning, 2018.<br />
<br />
[9] Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, 2019.<br />
<br />
[10] Osawa, K., Swaroop, S., Jain, A., Eschenhagen, R., Turner, R. E., Yokota, R., and Khan, M. E. Practical deep learning with Bayesian principles. In Advances in Neural Information Processing Systems, 2019.<br />
<br />
[11] Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017.<br />
<br />
[12] Neumann, L., Zisserman, A., and Vedaldi, A. Relaxed softmax: Efficient confidence auto-calibration for safe pedestrian detection. In NIPS Workshop on Machine Learning for Intelligent Transportation Systems, 2018.<br />
<br />
[13] Xie, L., Wang, J., Wei, Z., Wang, M., and Tian, Q. Disturblabel: Regularizing cnn on the loss layer. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.<br />
<br />
[14] Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Being_Bayesian_about_Categorical_Probability&diff=49702Being Bayesian about Categorical Probability2020-12-07T03:41:29Z<p>X277lu: /* Introduction */</p>
<hr />
<div>== Presented By ==<br />
Evan Li, Jason Pu, Karam Abuaisha, Nicholas Vadivelu<br />
<br />
== Introduction ==<br />
<br />
Since the outputs of neural networks are not probabilities, Softmax (Bridle, 1990) is a staple for neural network’s performing classification -- it exponentiates each logit then normalizes by the sum, giving a distribution over the target classes. Logit is a raw output/prediction of the model which is hard for humans to interpret, thus we transform/normalize these raw values into categories or meaningful numbers for interpretability. However, networks with softmax outputs give no information about uncertainty (Blundell et al., 2015; Gal & Ghahramani, 2016), and the resulting distribution over classes is poorly calibrated (Guo et al., 2017), often giving overconfident predictions even when the classification is wrong. In addition, softmax also raises concerns about overfitting NNs due to its confident predictive behaviors (Xie et al., 2016; Pereyra et al., 2017). To achieve performance with better generalization, some more effective regularization techniques might be required. <br />
<br />
Bayesian Neural Networks (BNNs; MacKay, 1992) can alleviate these issues, but the resulting posteriors over the parameters are often intractable. Approximations such as variational inference (Graves, 2011; Blundell et al., 2015) and Monte Carlo Dropout (Gal & Ghahramani, 2016) can still be expensive or give poor estimates for the posteriors. This work proposes a Bayesian treatment of the output logits of the neural network, treating the targets as a categorical random variable instead of a fixed label. This technique gives a computationally cheap way of being Bayesian to get well-calibrated uncertainty estimates on neural network classifications.<br />
<br />
== Related Work ==<br />
<br />
Using Bayesian Neural Networks is the dominant way of applying Bayesian techniques to neural networks. Many techniques have been developed to make posterior approximation more accurate and scalable, despite these, BNNs do not scale to the state of the art techniques or large data sets. There are techniques to explicitly avoid modeling the full weight posterior that is more scalable, such as with Monte Carlo Dropout (Gal & Ghahramani, 2016) or tracking mean/covariance of the posterior during training (Mandt et al., 2017; Zhang et al., 2018; Maddox et al., 2019; Osawa et al., 2019). Non-Bayesian uncertainty estimation techniques such as deep ensembles (Lakshminarayanan et al., 2017) and temperature scaling (Guo et al., 2017; Neumann et al., 2018).<br />
<br />
== Preliminaries ==<br />
=== Definitions ===<br />
Let's formalize our classification problem and define some notations for the rest of this summary:<br />
<br />
::Dataset:<br />
$$ \mathcal D = \{(x_i,y_i)\} \in (\mathcal X \times \mathcal Y)^N $$<br />
::General classification model<br />
$$ f^W: \mathcal X \to \mathbb R^K $$<br />
::Softmax function: <br />
$$ \phi(x): \mathbb R^K \to [0,1]^K \;\;|\;\; \phi_k(X) = \frac{\exp(f_k^W(x))}{\sum_{k \in K} \exp(f_k^W(x))} $$<br />
::Softmax activated NN:<br />
$$ \phi \;\circ\; f^W: \chi \to \Delta^{K-1} $$<br />
::NN as a true classifier:<br />
$$ arg\max_i \;\circ\; \phi_i \;\circ\; f^W \;:\; \mathcal X \to \mathcal Y $$<br />
<br />
We'll also define the '''count function''' - a <math>K</math>-vector valued function that outputs the occurences of each class coincident with <math>x</math>:<br />
$$ c^{\mathcal D}(x) = \sum_{(x',y') \in \mathcal D} \mathbb y' I(x' = x) $$<br />
<br />
=== Classification With a Neural Network ===<br />
A typical loss function used in classification is cross-entropy, which is defined by<br />
<br />
$$ l_{\rm CE}(\tilde{y},\phi(f^{W}(x)))=-\sum_k \tilde{y_k} \log \phi_k(f^{W}(x))) $$<br />
<br />
,here <math>y_k</math> and <math>\phi_k</math> refers to the actual and predicted categorical distribution for each class. It's well known that optimizing <math>f^W</math> for <math>l_{CE}</math> is equivalent to optimizing for <math>l_{KL}</math>, the <math>KL</math> divergence between the true distribution and the distribution modeled by NN, that is:<br />
$$ l_{KL}(W) = KL(\text{true distribution} \;|\; \text{distribution encoded by }NN(W)) $$<br />
Let's introduce notations for the underlying (true) distributions of our problem. Let <math>(x_0,y_0) \sim (\mathcal X \times \mathcal Y)</math>:<br />
$$ \text{Full Distribution} = F(x,y) = P(x_0 = x,y_0 = y) $$<br />
$$ \text{Marginal Distribution} = P(x) = F(x_0 = x) $$<br />
$$ \text{Point Class Distribution} = P(y_0 = y \;|\; x_0 = x) = F_x(y) $$<br />
Then we have the following factorization:<br />
$$ F(x,y) = P(x,y) = P(y|x)P(x) = F_x(y)F(x) $$<br />
Substitute this into the definition of KL divergence:<br />
$$ = \sum_{(x,y) \in \mathcal X \times \mathcal Y} F(x,y) \log\left(\frac{F(x,y)}{\phi_y(f^W(x))}\right) $$<br />
$$ = \sum_{x \in \mathcal X} F(x) \sum_{y \in \mathcal Y} F(y|x) \log\left( \frac{F(y|x)}{\phi_y(f^W(x))} \right) $$<br />
$$ = \sum_{x \in \mathcal X} F(x) \sum_{y \in \mathcal Y} F_x(y) \log\left( \frac{F_x(y)}{\phi_y(f^W(x))} \right) $$<br />
$$ = \sum_{x \in \mathcal X} F(x) KL(F_x \;||\; \phi\left( f^W(x) \right)) $$<br />
As usual, we don't have an analytic form for <math>l</math> (if we did, this would imply we know <math>F_X</math> meaning we knew the distribution in the first place). Instead, estimate from <math>\mathcal D</math>:<br />
$$ F(x) \approx \hat F(x) = \frac{||c^{\mathcal D}(x)||_1}{N} $$<br />
$$ F_x(y) \approx \hat F_x(y) = \frac{c^{\mathcal D}(x)}{|| c^{\mathcal D}(x) ||_1}$$<br />
$$ \to l_{KL}(W) = \sum_{x \in \mathcal D} \frac{||c^{\mathcal D}(x)||_1}{N} KL \left( \frac{c^{\mathcal D}(x)}{||c^{\mathcal D}(x)||_1} \;||\; \phi(f^W(x)) \right) $$<br />
The approximations <math>\hat F, \hat F_X</math> are often not very good though: consider a typical classification such as MNIST, we would never expect two handwritten digits to produce the exact same image. Hence <math>c^{\mathcal D}(x)</math> is (almost) always going to have a single index 1 and the rest 0. This has implications for our approximations:<br />
$$ \hat F(x) \text{ is uniform for all } x \in \mathcal D $$<br />
$$ \hat F_x(y) \text{ is degenerate for all } x \in \mathcal D $$<br />
This clearly has implications for overfitting: to minimize the KL term in <math>l_{KL}(W)</math> we want <math>\phi(f^W(x))</math> to be very close to <math>\hat F_x(y)</math> at each point - this means that the loss function is in fact encouraging the neural network to output near degenerate distributions! <br />
<br />
'''Label Smoothing'''<br />
<br />
One form of regularization to help this problem is called label smoothing. Instead of using the degenerate $$F_x(y)$$ as a target function, let's "smooth" it (by adding a scaled uniform distribution to it) so it's no longer degenerate:<br />
$$ F'_x(y) = (1-\lambda)\hat F_x(y) + \frac \lambda K \vec 1 $$<br />
<br />
'''BNNs'''<br />
<br />
BBNs balances the complexity of the model and the distance to target distribution without choosing a single beset configuration (one-hot encoding). Specifically, BNNs with the Gaussian Weight prior $$F_x(y) = N (0,T^{-1} I)$$ has score of configuration <math>W</math> measured by the posterior density $$p_W(W|D) = p(D|W)p_W(W), \log(p_W(W)) = T||W||^2_2$$<br />
Here <math>||W||^2_2</math> could be a poor proxy to penalized for the model complexity due to its linear nature.<br />
<br />
== Method ==<br />
The main technical proposal of the paper is a Bayesian framework to estimate the (former) target distribution <math>F_x(y)</math>. That is, we construct a posterior distribution of <math> F_x(y) </math> and use that as our new target distribution. We call it the ''belief matching'' (BM) framework.<br />
<br />
=== Constructing Target Distribution ===<br />
Recall that <math>F_x(y)</math> is a k-categorical probability distribution - its PMF can be fully characterized by k numbers that sum to 1. Hence we can encode any such <math>F_x</math> as a point in <math>\Delta^{k-1}</math>. We'll do exactly that - let's call this vector <math>z</math>:<br />
$$ z \in \Delta^{k-1} $$<br />
$$ \text{prior} = p_{z|x}(z) $$<br />
$$ \text{conditional} = p_{y|z,x}(y) $$<br />
$$ \text{posterior} = p_{z|x,y}(z) $$<br />
Then if we perform inference:<br />
$$ p_{z|x,y}(z) \propto p_{z|x}(z)p_{y|z,x}(y) $$<br />
The distribution chosen to model prior was <math>dir_K(\beta)</math>:<br />
$$ p_{z|x}(z) = \frac{\Gamma(||\beta||_1)}{\prod_{k=1}^K \Gamma(\beta_k)} \prod_{k=1}^K z_k^{\beta_k - 1} $$<br />
Note that by definition of <math>z</math>: <math> p_{y|x,z} = z_y </math>. Since the Dirichlet is a conjugate prior to categorical distributions we have a convenient form for the mean of the posterior:<br />
$$ \bar{p_{z|x,y}}(z) = \frac{\beta + c^{\mathcal D}(x)}{||\beta + c^{\mathcal D}(x)||_1} \propto \beta + c^{\mathcal D}(x) $$<br />
This is in fact a generalization of (uniform) label smoothing (label smoothing is a special case where <math>\beta = \frac 1 K \vec{1} </math>).<br />
<br />
=== Representing Approximate Distribution ===<br />
Our new target distribution is <math>p_{z|x,y}(z)</math> (as opposed to <math>F_x(y)</math>). That is, we want to construct an interpretation of our neural network weights to construct a distribution with support in <math> \Delta^{K-1} </math> - the NN can then be trained so this encoded distribution closely approximates <math>p_{z|x,y}</math>. Let's denote the PMF of this encoded distribution <math>q_{z|x}^W</math>. This is how the BM framework defines it:<br />
$$ \alpha^W(x) := \exp(f^W(x)) $$<br />
$$ q_{z|x}^W(z) = \frac{\Gamma(||\alpha^W(x)||_1)}{\sum_{k=1}^K \Gamma(\alpha_k^W(x))} \prod_{k=1}^K z_{k}^{\alpha_k^W(x) - 1} $$<br />
$$ \to Z^W_x \sim dir(\alpha^W(x)) $$<br />
Apply <math>\log</math> then <math>\exp</math> to <math>q_{z|x}^W</math>:<br />
$$ q^W_{z|x}(z) \propto \exp \left( \sum_k (\alpha_k^W(x) \log(z_k)) - \sum_k \log(z_k) \right) $$<br />
$$ \propto -l_{CE}(\phi(f^W(x)),z) + \frac{K}{||\alpha^W(x)||}KL(\mathcal U_k \;||\; z) $$<br />
It can actually be shown that the mean of <math>Z_x^W</math> is identical to <math>\phi(f^W(x))</math> - in other words, if we output the mean of the encoded distribution of our neural network under the BM framework, it is theoretically identical to a traditional neural network.<br />
<br />
=== Distribution Matching ===<br />
<br />
We now need a way to fit our approximate distribution from our neural network <math>q_{\mathbf{z | x}}^{\mathbf{W}}</math> to our target distribution <math>p_{\mathbf{z|x},y}</math>. The authors achieve this by maximizing the evidence lower bound (ELBO):<br />
<br />
$$l_{EB}(\mathbf y, \alpha^{\mathbf W}(\mathbf x)) = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] - KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; p_{\mathbf{z|x}}) $$<br />
<br />
Each term can be computed analytically:<br />
<br />
$$\mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf W }} \left[\log z_y \right] = \psi(\alpha_y^{\mathbf W} ( \mathbf x )) - \psi(\alpha_0^{\mathbf W} ( \mathbf x )) $$<br />
<br />
Where <math>\psi(\cdot)</math> represents the digamma function (logarithmic derivative of gamma function). Intuitively, we maximize the probability of the correct label. For the KL term:<br />
<br />
$$KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; p_{\mathbf{z|x}}) = \log \frac{\Gamma(a_0^{\mathbf W}(\mathbf x)) \prod_k \Gamma(\beta_k)}{\prod_k \Gamma(\alpha_k^{\mathbf W}(x)) \Gamma (\beta_0)} + \sum_k (\alpha_k^{\mathbf W}(x)-\beta_k)(\psi(\alpha_k^{\mathbf W}(\mathbf x)) - \psi(\alpha_0^{\mathbf W}(\mathbf x)) $$<br />
<br />
In the first term, for intuition, we can ignore <math>\alpha_0</math> and <math>\beta_0</math> since those just calibrate the distributions. Otherwise, we want the ratio of the products to be as close to 1 as possible to minimize the KL. In the second term, we want to minimize the difference between each individual <math>\alpha_k</math> and <math>\beta_k</math>, scaled by the normalized output of the neural network. <br />
<br />
This loss function can be used as a drop-in replacement for the standard softmax cross-entropy, as it has an analytic form and the same time complexity as typical softmax-cross entropy with respect to the number of classes (<math>O(K)</math>).<br />
<br />
=== On Prior Distributions ===<br />
<br />
We must choose our concentration parameter, <math>\beta</math>, for our dirichlet prior. We see our prior essentially disappears as <math>\beta_0 \to 0</math> and becomes stronger as <math>\beta_0 \to \infty</math>. Thus, we want a small <math>\beta_0</math> so the posterior isn't dominated by the prior. But, the authors claim that a small <math>\beta_0</math> makes <math>\alpha_0^{\mathbf W}(\mathbf x)</math> small, which causes <math>\psi (\alpha_0^{\mathbf W}(\mathbf x))</math> to be large, which is problematic for gradient based optimization. In practice, many neural network techniques aim to make <math>\mathbb E [f^{\mathbf W} (\mathbf x)] \approx \mathbf 0</math> and thus <math>\mathbb E [\alpha^{\mathbf W} (\mathbf x)] \approx \mathbf 1</math>, which means making <math>\alpha_0^{\mathbf W}(\mathbf x)</math> small can be counterproductive.<br />
<br />
So, the authors set <math>\beta = \mathbf 1</math> and introduce a new hyperparameter <math>\lambda</math> which is multiplied with the KL term in the ELBO:<br />
<br />
$$l^\lambda_{EB}(\mathbf y, \alpha^{\mathbf W}(\mathbf x)) = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] - \lambda KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; \mathcal P^D (\mathbf 1)) $$<br />
<br />
This stabilizes the optimization, as we can tell from the gradients:<br />
<br />
$$\frac{\partial l_{E B}\left(\mathbf{y}, \alpha^{\mathbf W}(\mathbf{x})\right)}{\partial \alpha_{k}^{\mathbf W}(\mathbf {x})}=\left(\tilde{\mathbf{y}}_{k}-\left(\alpha_{k}^{\mathbf W}(\mathbf{x})-\beta_{k}\right)\right) \psi^{\prime}\left(\alpha_{k}^{\mathbf{W}}(\boldsymbol{x})\right)<br />
-\left(1-\left(\alpha_{0}^{\boldsymbol{W}}(\boldsymbol{x})-\beta_{0}\right)\right) \psi^{\prime}\left(\alpha_{0}^{\boldsymbol{W}}(\boldsymbol{x})\right)$$<br />
<br />
$$\frac{\partial l_{E B}^{\lambda}\left(\mathbf{y}, \alpha^{\mathbf{W}}(\mathbf{x})\right)}{\partial \alpha_{k}^{W}(\mathbf{x})}=\left(\tilde{\mathbf{y}}_{k}-\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})-\lambda\right)\right) \frac{\psi^{\prime}\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})\right)}{\psi^{\prime}\left(\tilde{\alpha}_{0}^{\mathbf W}(\mathbf{x})\right)}<br />
-\left(1-\left(\tilde{\alpha}_{0}^{W}(\mathbf{x})-\lambda K\right)\right)$$<br />
<br />
As we can see, the first expression is affected by the magnitude of <math>\alpha^{\boldsymbol{W}}(\boldsymbol{x})</math>, whereas the second expression is not due to the <math>\frac{\psi^{\prime}\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})\right)}{\psi^{\prime}\left(\tilde{\alpha}_{0}^{\mathbf W}(\mathbf{x})\right)}</math> ratio.<br />
<br />
== Experiments ==<br />
<br />
Throughout the experiments in this paper, the authors employ various models based on residual connections (He et al., 2016 [1]) which are the models used for benchmarking in practice. We will first demonstrate improvements provided by BM, then we will show versatility in other applications. For fairness of comparisons, all configurations in the reference implementation will be fixed. The only additions in the experiments are initial learning rate warm-up and gradient clipping which are extremely helpful for stable training of BM. <br />
<br />
=== Generalization performance === <br />
The paper compares the generalization performance of BM with softmax and MC dropout on CIFAR-10 and CIFAR-100 benchmarks.<br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_T1.png]]<br />
<br />
The next comparison was performed between BM and softmax on the ImageNet benchmark. <br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_T2.png]]<br />
<br />
For both datasets and In all configurations, BM achieves the best generalization and outperforms softmax and MC dropout.<br />
<br />
===== Regularization effect of prior =====<br />
<br />
In theory, BM has 2 regularization effects:<br />
The prior distribution, which smooths the target posterior<br />
Averaging all of the possible categorical probabilities to compute the distribution matching loss<br />
The authors perform an ablation study to examine the 2 effects separately - removing the KL term in the ELBO removes the effect of the prior distribution.<br />
For ResNet-50 on CIFAR-100 and CIFAR-10 the resulting test error rates were 24.69% and 5.68% respectively. <br />
<br />
This demonstrates that both regularization effects are significant since just having one of them improves the generalization performance compared to the softmax baseline, and having both improves the performance even more.<br />
<br />
===== Impact of <math>\beta</math> =====<br />
<br />
The effect of β on generalization performance is studied by training ResNet-18 on CIFAR-10 by tuning the value of β on its own, as well as jointly with λ. It was found that robust generalization performance is obtained for β ∈ [<math>e^{−1}, e^4</math>] when tuning β on its own; and β ∈ [<math>e^{−4}, e^{8}</math>] when tuning β jointly with λ. The figure below shows a plot of the error rate with varying β.<br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_F3.png]]<br />
<br />
=== Uncertainty Representation ===<br />
<br />
One of the big advantages of BM is the ability to represent uncertainty about the prediction. The authors evaluate the uncertainty representation on in-distribution (ID) and out-of-distribution (OOD) samples. <br />
<br />
===== ID uncertainty =====<br />
<br />
For ID (in-distribution) samples, calibration performance is measured, which is a measure of how well the model’s confidence matches its actual accuracy. This measure can be visualized using reliability plots and quantified using a metric called expected calibration error (ECE). ECE is calculated by grouping predictions into M groups based on their confidence score and then finding the absolute difference between the average accuracy and average confidence for each group. We can define the ECE of <math>f^W </math> on <math>D </math> with <math>M</math> groups as <br />
<br />
<center><br />
<math>ECE_M(f^W, D) = \sum^M_{i=1} \frac{|G_i|}{|D|}|acc(G_i) - conf(G_i)|</math><br />
</center><br />
Where <math>G_i</math> is a set of samples int the i-th group defined as <math>G_i = \{j:i/M < max_k\phi_k(f^Wx^{(j)}) \leq (1+i)/M\}</math>, <math>acc(G_i)</math> is an average accuracy in the i-th group and <math>conf(G_i)</math> is an average confidence in the i-th group.<br />
<br />
The figure below is a reliability plot of ResNet-50 on CIFAR-10 and CIFAR-100 with 15 groups. It shows that BM has a significantly better calibration performance than softmax since the confidence matches the accuracy more closely (this is also reflected in the lower ECE).<br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_F4.png]]<br />
<br />
===== OOD uncertainty =====<br />
<br />
Here, the authors quantify uncertainty using predictive entropy - the larger the predictive entropy, the larger the uncertainty about a prediction. <br />
<br />
The figure below is a density plot of the predictive entropy of ResNet-50 on CIFAR-10. It shows that BM provides significantly better uncertainty estimation compared to other methods since BM is the only method that has a clear peak of high predictive entropy for OOD samples which should have high uncertainty. <br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_F5.png]]<br />
<br />
=== Transfer learning ===<br />
<br />
Belief matching applies the Bayesian principle outside the neural network, which means it can easily be applied to already trained models. Thus, belief matching can be employed in transfer learning scenarios. The authors downloaded the ImageNet pre-trained ResNet-50 weights and fine-tuned the weights of the last linear layer for 100 epochs using an Adam optimizer.<br />
<br />
This table shows the test error rates from transfer learning on CIFAR-10, Food-101, and Cars datasets. Belief matching consistently performs better than softmax. <br />
<br />
[[File:being_bayesian_about_categorical_probability_transfer_learning.png]]<br />
<br />
Belief matching was also tested for the predictive uncertainty for out of dataset samples based on CIFAR-10 as the in distribution sample. Looking at the figure below, it is observed that belief matching significantly improves the uncertainty representation of pre-trained models by only fine-tuning the last layer’s weights. Note that belief matching confidently predicts examples in Cars since CIFAR-10 contains the object category automobiles. In comparison, softmax produces confident predictions on all datasets. Thus, belief matching could also be used to enhance the uncertainty representation ability of pre-trained models without sacrificing their generalization performance.<br />
<br />
[[File: being_bayesian_about_categorical_probability_transfer_learning_uncertainty.png]]<br />
<br />
=== Semi-Supervised Learning ===<br />
<br />
Belief matching’s ability to allow neural networks to represent rich information in their predictions can be exploited to aid consistency based loss function for semi-supervised learning. Consistency-based loss functions use unlabelled samples to determine where to promote the robustness of predictions based on stochastic perturbations. This can be done by perturbing the inputs (which is the VAT model) or the networks (which is the pi-model). Both methods minimize the divergence between two categorical probabilities under some perturbations, thus belief matching can be used by the following replacements in the loss functions. The hope is that belief matching can provide better prediction consistencies using its Dirichlet distributions.<br />
<br />
[[File: being_bayesian_about_categorical_probability_semi_supervised_equation.png]]<br />
<br />
The results of training on ResNet28-2 with consistency based loss functions on CIFAR-10 are shown in this table. Belief matching does have lower classification error rates compared to using a softmax.<br />
<br />
[[File:being_bayesian_about_categorical_probability_semi_supervised_table.png]]<br />
<br />
== Conclusion and Critiques ==<br />
<br />
* Bayesian principles can be used to construct the target distribution by using the categorical probability as a random variable rather than a training label. This can be applied to neural network models by replacing only the softmax and cross-entropy loss while improving the generalization performance, uncertainty estimation and well-calibrated behavior. <br />
<br />
* In the future, the authors would like to allow for more expressive distributions in the belief matching framework, such as logistic normal distributions to capture strong semantic similarities among class labels. Furthermore, using input dependent priors would allow for interesting properties that would aid imbalanced datasets and multi-domain learning.<br />
<br />
* Overall I think this summary is very good. The Method(Algorithm) section is described clearly, and the Results section is detailed, with many diagrams illustrating the main points. I just have one technical suggestion: the difference in performance for SOFTMAX and BM differs by model. For example, for RESNEXT-50 model, the difference in top1 is 0.2, whereas, for the RESNEXT-100 model, the difference in the top one is 0.5, which is significantly higher. It's true that BM method generally outperforms SOFTMAX. But seeing the relation between the choice of model and the magnitude of a performance increase could definitely strengthen the paper even further.<br />
<br />
* The summary is good and the topic is interesting. Bayesian is a well know probabilistic model but did not know that it can be used as a neural network. Comparison between softmax and bayesian was interesting and more details would be great.<br />
<br />
* It would be better if there is a future work section to discuss the current shortage and potential improvement. One thing would be that the theoretical part is complex in the process. In addition, optimizing a function is relatively hard if the structure is complex. Is it possible to have a good approximation without having too complex a calculation?<br />
<br />
* Both experiments dealt with image data, however, softmax is used within classification neural networks that range from image to textual data. It would be interesting to see the performance of BM on textual data for text classification problems in addition to image classification.<br />
<br />
* It would be better to briefly explain Bayesian treatment in the introduction part(i.e., considering the categorical probability as a random variable, construct the target distribution by means of the Bayesian inference), and to analyze the importance of considering the categorical probability as a random variable (for example explain it can be adapted to existing deep learning building blocks without huge modifications).<br />
<br />
* Interesting topic that goes close to our lectures. Since this is a summary of the paper, it would be better if trim the explanation on Neural Network a little like getting rid of the substitution lines.<br />
<br />
* I really liked the presentation and actually really appreciate the steps of the detailed derivation that were presented in this summary. In the introduction the researchers mentioned that BM is a computationally cheap method, however, I was wondering how much faster it is computationally as opposed to the other models to train. Additionally, the training data that was used to benchmark the classification performance seemed to all be image classifications (CIFAR-10, CIFAR-100, ResNet-50, ResNet-101), thus it would have been nice to see classification be applied in other multi-class contexts as well to see how well this new method performs there.<br />
<br />
== Citations ==<br />
<br />
[1] Bridle, J. S. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pp. 227–236. Springer, 1990.<br />
<br />
[2] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. In International Conference on Machine Learning, 2015.<br />
<br />
[3] Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 2016.<br />
<br />
[4] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning, 2017. <br />
<br />
[5] MacKay, D. J. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448– 472, 1992.<br />
<br />
[6] Graves, A. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, 2011. <br />
<br />
[7] Mandt, S., Hoffman, M. D., and Blei, D. M. Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research, 18(1):4873–4907, 2017.<br />
<br />
[8] Zhang, G., Sun, S., Duvenaud, D., and Grosse, R. Noisy natural gradient as variational inference. In International Conference of Machine Learning, 2018.<br />
<br />
[9] Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, 2019.<br />
<br />
[10] Osawa, K., Swaroop, S., Jain, A., Eschenhagen, R., Turner, R. E., Yokota, R., and Khan, M. E. Practical deep learning with Bayesian principles. In Advances in Neural Information Processing Systems, 2019.<br />
<br />
[11] Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017.<br />
<br />
[12] Neumann, L., Zisserman, A., and Vedaldi, A. Relaxed softmax: Efficient confidence auto-calibration for safe pedestrian detection. In NIPS Workshop on Machine Learning for Intelligent Transportation Systems, 2018.<br />
<br />
[13] Xie, L., Wang, J., Wei, Z., Wang, M., and Tian, Q. Disturblabel: Regularizing cnn on the loss layer. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.<br />
<br />
[14] Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Evaluating_Machine_Accuracy_on_ImageNet&diff=49701Evaluating Machine Accuracy on ImageNet2020-12-07T03:40:12Z<p>X277lu: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du<br />
<br />
== Introduction == <br />
ImageNet is one of the most influential dataset in machine learning with images and corresponding labels over 1000 classes. This paper intends to explore the causes for performance differences between human experts and machine learning models, more specifically, CNN, on ImageNet. <br />
<br />
Firstly, some images could belong to multiple classes. As a result, it is possible to underestimate the performance if we assign each image with only one label, which is what is being done in the top-1 metric. On the other hand, the top-5 metric looks at the top five predictions by the model for an image and checks if the target label is within those five predictions (Krizhevsky, Sutskever, & Hinton). Therefore, we adopt both top-1 and top-5 metrics where the performances of models, unlike human labelers, are linearly correlated in both cases.<br />
<br />
Secondly, in contrast to the uniform performance of models in classes, humans tend to achieve better performances on inanimate objects. Human labelers achieve similar overall accuracies as the models, which indicates spaces of improvements on specific classes for machines.<br />
<br />
Lastly, the setup of drawing training and test sets from the same distribution may favor models over human labelers. That is, the accuracy of multi-class prediction from models drops when the testing set is drawn from a different distribution than the training set, ImageNetV2. But this shift in distribution does not cause a problem for human labelers.<br />
<br />
== Experiment Setup ==<br />
=== Overview ===<br />
There are four main phases to the experiment, which are (i) initial multilabel annotation, (ii) human labeler training, (iii) human labeler evaluation, and (iv) final annotation overview. The five authors of the paper are the participants in the experiments. <br />
<br />
A brief overview of the four phases is as follows:<br />
[[File:Experiment Set Up.png |800px| center]]<br />
<br />
=== Initial multi-label annotation ===<br />
Three labelers A, B, and C provided multi-label annotations for a subset from the ImageNet validation set, and all images from the ImageNetV2 test sets. These experiences give A, B, and C extensive experience with the ImageNet dataset. <br />
<br />
=== Human Labeler Training === <br />
All five labelers trained on labeling a subset of the remaining ImageNet images. "Training" the human labelers consisted of teaching the humans the distinctions between very similar classes in the training set. For example, there are 118 classes of "dog" within ImageNet and typical human participants will not have working knowledge of the names of each breed of dog seen even if they can recognize and distinguish that breed from others. Local members of the American Kennel Club were even contacted to help with dog breed classification. To do this labelers were trained on class-specific tasks for groups like dogs, insects, monkeys beaver and others. They were also given immediate feedback on whether they were correct and then were asked where they thought they needed more training to improve. Unlike the two annotators in (Russakovsky et al., 2015), who had insufficient training data, the labelers in this experiment had up to 100 training images per class while labeling. This allowed the labelers to really understand the finer details of each class.<br />
<br />
=== Human Labeler Evaluation ===<br />
Class-balanced random samples, which contain 1,000 images from the 20,000 annotated images are generated from both the ImageNet validation set and ImageNetV2. Five participants labeled these images over 28 days.<br />
<br />
=== Final annotation Review ===<br />
All labelers reviewed the additional annotations generated in the human labeler evaluation phase.<br />
<br />
== Multi-label annotations==<br />
[[File:Categories Multilabel.png|800px|center]]<br />
<div align="center">Figure 3</div><br />
<br />
===Top-1 accuracy===<br />
With Top-1 accuracy being the standard accuracy measure used in classification studies, it measures the proportions of examples for which the predicted label matches the single target label. As many images often contain more than one object for classification, for example, Figure 3a contains a desk, laptop, keyboard, space bar, and more. With Figure 3b showing a centered prominent figure yet labeled otherwise (people vs picket fence), it can be seen how a single target label is inaccurate for such a task since identifying the main objects in the image does not suffice due to its overly stringent and punishes predictions that are the main image yet does not match its label.<br />
===Top-5 accuracy===<br />
With Top-5 considers a classification correct if the object label is in the top 5 predicted labels. Although it partially resolves the problem with Top-1 labeling, it is still not ideal since it can trivialize class distinctions. For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations.<br />
<br />
===Multi-label accuracy===<br />
The paper then proposes that for every image, the image shall have a set of target labels and a prediction; if such prediction matches one of the labels, it will be considered as correct labeling. Due to the above-discussed limitations of Top-1 and Top-5 metrics, the paper claims it is necessary for rigorous accuracy evaluation on the dataset. <br />
<br />
===Types of Multi-label annotations===<br />
====Multiple objects or organisms====<br />
For the images containing more than one object or organism that corresponds to ImageNet, the paper proposed to add an additional target label for each entity in the image. With the discussed image in Figure 3b, the class groom, bow tie, suit, gown, and hoopskirt are all present in the foreground which is then subsequently added to the set of labels.<br />
====Synonym or subset relations====<br />
For similar classes, the paper considers them as under the same bigger class, that is, for two similarly labeled images, classification is considered correct if the produced label matches either one of the labels. For instance, warthog, African elephant, and Indian element all have prominent tusks, they will be considered subclasses of the tusker, Figure 3c shows a modification of labels to contain tusker as a correct label.<br />
====Unclear Image====<br />
In certain cases such as Figure 3d, there is a distinctive difficulty to determine whether a label was correct due to ambiguities in the class hierarchy.<br />
===Collecting multi-label annotations===<br />
Participants reviewed all predictions made by the models on the dataset ImageNet and ImageNet-V2, the participants then categorized every unique prediction made by the models on the dataset into correct and incorrect labels in order to allow all images to have multiple correct labels to satisfy the above-listed method.<br />
===The multi-label accuracy metric===<br />
One prediction is only correct if and only if it was marked correct by the expert reviewers during the annotation stage. As discussed in the experiment setup section, after human labelers have completed labeling, a second annotation stage is conducted. In Figure 4, a comparison of Top-1, Top-5, and multi-label accuracies showed higher Top-1 and Top-5 accuracy corresponds with higher multi-label accuracy as expected. With multi-label accuracies measures consistently higher than Top-1 yet lower than Top-5 which shows a high correlation between the three metrics, the paper concludes that multi-label metrics measures a semantically more meaningful notion of accuracy compared to its counterparts.<br />
<br />
== Human Accuracy Measurement Process ==<br />
=== Bias Control ===<br />
Since three participants participated in the initial round of annotation, they did not look at the data for six months, and two additional annotators are introduced in the final evaluation phase to ensure fairness of the experiment. <br />
<br />
=== Human Labeler Training ===<br />
The three main difficulties encountered during human labeler training are fine-grained distinctions, class unawareness, and insufficient training images. Thus, three training regimens are provided to address the problems listed above, respectively. First, labelers will be assigned extra training tasks with immediate feedbacks on similar classes. Second, labelers will be provided access to search for specific classes during labeling. Finally, the training set will contain a reasonable amount of images for each class.<br />
<br />
=== Labeling Guide ===<br />
A labeling guide is constructed to distill class analysis learned during training into discriminative traits that could be used as a reference during the final labeling evaluation.<br />
<br />
=== Final Evaluation and Review ===<br />
Two samples, each containing 1000 images, are sampled from ImageNet and ImageNetV2, respectively, They are sampled in a class-balanced manner and shuffled together. Over 28 days, all five participants labeled all images. They spent a median of 26 seconds per image. After labeling is completed, an additional multi-label annotation session was conducted, in which human predictions for all images are manually reviewed. Comparing to the initial round of labeling, 37% of the labels changes due to participants' greater familiarity with the classes.<br />
<br />
== Main Results ==<br />
[[File:Evaluating Machine Accuracy on ImageNet Figure 1.png | center]]<br />
<br />
<div align="center">Figure 1</div><br />
<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
From Figure 1, we can see that the difference in accuracies between the datasets is within 1% for all human participants. As hypothesized, human testers indeed performed better than the automated models on both datasets. It's worth noticing that labelers D and E, who did not participate in the initial annotation period, actually performed better than the best automated model.<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
Based on the results shown in Figure 1, we can see that the confidence interval of the best 4 human participants and 4 best model overlap; however, with a p-value of 0.037 using the McNemar's paired test, it rejects the hypothesis that the FixResNeXt model and Human E labeler have the same accuracy with respect to the ImageNet validation dataset. Figure 1 also shows that the confidence intervals of the labeling accuracies for human labelers C, D, E do not overlap with the confidence interval of the best model with respect to ImageNet-V2 and with the McNemar's test yielding a p-value of <math>2\times 10^{-4}</math>, it is clear that the hypothesis human and machined models have same robustness to model distribution shifts ought to be rejected.<br />
<br />
== Other Observations ==<br />
<br />
[[File: Results_Summary_Table.png| 800px|center]]<br />
<br />
=== Difficult Images ===<br />
<br />
The experiment also shed some light on images that are difficult to label. 10 images were misclassified by all of the human labelers. Among those 10 images, there was 1 image of a monkey and 9 of dogs. In addition, 27 images, with 19 in object classes and 8 in organism classes, were misclassified by all 72 machine learning models in this experiment. Only 2 images were labeled wrong by all human labelers and models. Both images contained dogs. Researchers also noted that difficult images for models are mostly images of objects and exclusively images of animals for human labelers.<br />
<br />
=== Accuracies without dogs ===<br />
<br />
As previously discussed in the paper, machine learning models tend to outperform human labelers when classifying the 118 dog classes. To better understand to what extent does models outperform human labelers, researchers computed the accuracies again by excluding all the dog classes. Results showed a 0.6% increase in accuracy on the ImageNet images using the best model and a 1.1% increase on the ImageNet V2 images. In comparison, the mean increases in accuracy for human labelers are 1.9% and 1.8% on the ImageNet and ImageNet V2 images respectively. Researchers also conducted a simulation to demonstrate that the increase in human labeling accuracy on non-dog images is significant. This simulation was done by bootstrapping to estimate the changes in accuracy when only using data for the non-dog classes, and simulation results show smaller increases than in the experiment. <br />
<br />
In conclusion, it's more difficult for human labelers to classify images with dogs than it is for machine learning models.<br />
<br />
=== Accuracies on objects ===<br />
Researchers also computed machine and human labelers' accuracies on a subset of data with only objects, as opposed to organisms, to better illustrate the differences in performance. This test involved 590 object classes. As shown in the table above, there is a 3.3% and 3.4% increase in mean accuracies for human labelers on the ImageNet and ImageNet V2 images. In contrast, there is a 0.5% decrease in accuracy for the best model on both ImageNet and ImageNet V2. This indicates that human labelers are much better at classifying objects than these models are.<br />
<br />
=== Accuracies on fast images ===<br />
Unlike the CNN models, human labelers spent different amounts of time on different images, spanning from several seconds to 40 minutes. To further analyze the images that take human labelers less time to classify, researchers took a subset of images with median labeling time spent by human labelers of at most 60 seconds. These images were referred to as "fast images". There are 756 and 714 fast images from ImageNet and ImageNet V2 respectively, out of the total 2000 images used for evaluation. Accuracies of models and humans on the fast images increased significantly, especially for humans. <br />
<br />
This result suggests that human labelers know when an image is difficult to label and would spend more time on it. It also shows that the models are more likely to correctly label images that human labelers can label relatively quickly.<br />
<br />
== Related Work ==<br />
<br />
=== Human accuracy on ImageNet ===<br />
<br />
Russakovsky et al. (2015) studied two trained human labelers' accuracies on 1500 and 258 images in the context of the ImageNet challenge. The top-5 accuracy of the labeler who labeled 1500 images was the well-known human baseline on ImageNet. <br />
<br />
As introduced before, the researchers went beyond by using multi-label accuracy, using more labelers, and focusing on robustness to small distribution shifts. Although the researchers had some different findings, some results are also consistent with results from (Russakovsky et al., 2015). An example is that both experiments indicated that it takes human labelers around one minute to label an image. The time distribution also has a long tail, due to the difficult images as mentioned before.<br />
<br />
=== Human performance in computer vision broadly ===<br />
There are many examples of recent studies about humans in the area of computer vision, such as investigating human robustness to synthetic distribution change (Geirhos et al., 2017) and studying what characteristics do humans use to recognize objects (Geirhos et al., 2018). Other examples include the adversarial examples constructed to fool both machines and time-limited humans (Elsayed et al., 2018) and illustrating foreground/background objects' effects on human and machine performance (Zhu et al., 2016). <br />
<br />
=== Multi-label annotations ===<br />
Stock & Cissé (2017) also studied ImageNet's multi-label nature, which aligns with the researchers' study in this paper. According to Stock & Cissé (2017), the top-1 accuracy measure could underestimate multi-label by up to 13.2%. The authors suggest that releasing these labelled data to the public will allow for more robust models in the future.<br />
<br />
=== ImageNet inconsistencies and label error ===<br />
Researches have found and recorded some incorrectly labeled images from ImageNet and ImageNet V2 during this study. Earlier studies (Van Horn et al., 2015) also shown that at least 4% of the birds in ImageNet are misclassified. This work also noted that the inconsistent taxonomic structure in birds' classes could lead to weak class boundaries. Researchers also noted that the majority of the fine-grained organism classes also had similar taxonomic issues.<br />
<br />
=== Distribution shift ===<br />
There has been an increasing amount of studies in this area. One focus of the studies is distributionally robust optimization (DRO), which finds the model that has the smallest worst-case expected error over a set of probability distributions. Another focus is on finding the model with the lowest error rates on adversarial examples. Work in both areas has been productive, but none was shown to resolve the drop in accuracies between ImageNet and ImageNet V2. A recent [https://papers.nips.cc/paper/2019/file/8558cb408c1d76621371888657d2eb1d-Paper.pdf paper] also discusses quantifying uncertainty under a distribution shift, in other words whether the output of probabilistic deep learning models should or should not be trusted.<br />
<br />
== Conclusion and Future Work ==<br />
<br />
=== Conclusion ===<br />
Researchers noted that in order to achieve truly reliable machine learning, they need a deeper understanding of the range of parameters where the model still remain robust. Techniques from Combinatorics and sensitivity analysis, in particular, might yield fruitful results. This study has provided valuable insights into the desired robustness properties by comparing model performance to human performance. This is especially evident given the results of the experiment which show humans drastically outperforming machine learning in many cases and proposes the question of how much accuracy one is willing to give up in exchange for efficiency. The results have shown that current performance benchmarks are not addressing the robustness to small and natural distribution shifts, which are easily handled by humans.<br />
<br />
=== Future work ===<br />
Other than improving the robustness of models, researchers should consider investigating if less-trained human labelers can achieve a similar level of robustness to distributional shifts. In addition, researchers can study the robustness to temporal changes, which is another form of natural distribution shift (Gu et al., 2019; Shankar et al., 2019). Also, Convolutional Neural Network can be a candidate to improve the accuracy of classifying images.<br />
<br />
== Critiques ==<br />
# The method of using human to classify Imagenet is fully circular, since the label of imagenet itself is originally annotated by human beings. In fact, the classification scheme itself is intrinsically human construction. It is not logical to test human performance with human performance. This circular contsruction completely violates scientific principles.<br />
# Table 1 simply showed a difference in ImageNet multi-label accuracy yet does not give an explicit reason as to why such a difference is present. Although the paper suggested the distribution shift has caused the difference, it does not give other factors to concretely explain why the distribution shift was the cause.<br />
# With the recommendation to future machine evaluations, the paper proposed to "Report performances on dogs, other animals, and inanimate objects separately.". Despite its intentions, it is narrowly specific and requires further generalization for it to be convincing. <br />
# With choosing human subjects as samplers, no further information was given as to how they are chosen nor there are any background information was given. As it is a classification problem involving many classes as specific to species, a biology student would give far more accurate results than a computer science student or a math student. To make this study more plausible, more human labelers should be sampled.<br />
# As explaining the importance of multi-label metrics using comparison to Top-5 metric, the turtle example falls within the overall similarity (simony) classification of the multi-label evaluation metric, as such, if the Top-5 evaluation suggests any one of the turtle species were selected, the algorithm is considered to produce a correct prediction which is the intention. The example does not convey the necessity of changing to the proposed metric over the Top-5 metric. <br />
# With the definition in the paper regarding multi-label metrics, it is hard to see why expanding the label set is different from a traditional Top-5 metric or rather necessary, ergo does not yield the claim which the proposed metric is necessary for rigorous accuracy evaluation on ImageNet.<br />
# When discussing the main results, the paper discusses the hypothesis on distribution shift having no effects on human and machine model accuracies; the presentation is poor at best with no clear centric to what they are trying to convey to how (in detail) they resulted in such claims.<br />
# In the experiment setup of the presentation, there are a lot of key terms without detailed description. For example, Human labeler training using a subset of the remaining 30,000 unannotated images in the ImageNet validation set, labelers A, B, C, D, and E underwent extensive training to understand the intricacies of fine-grained class distinctions in the ImageNet class hierarchy. Authors should clarify each key term in the presentation otherwise readers are hard to follow.<br />
# Not sure how the human samplers were determined and simply picking several people will have really high bias because the sample is too small and they have different background which will definitely affect the results a lot. Also, it will be better if there are more comparisons between the model introduced and other models.<br />
# Given the low amount of human participants, it is hard to take the results seriously (there is too much variance). Also it's not exactly clear how the authors determined that the multi-label accuracy metric measures a semantically more meaningful notion of accuracy compared to its counterparts. For example, one of the issues with top-5 accuracy that they mention is: "For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations." But it's not clear how multi-label accuracy would be better in this instance.<br />
# It is unclear how well the human labeler can perform labeling after training. So the final result is not that trust-worthy.<br />
# In this experiment set up, label annotators are the same as participants of the experiments. Even if there's a break between the annotating and evaluating human labeler evaluation, the impact of the break in reducing bias is not clear. One potential human labeling data is google's "I'm not a robot" verification test. One variation of the verification test asks users to select all the photos from 9 images that are related to a certain keyword. This allows for a more accurate measurement of human performance vs ImageNet performance. In addition, it's going to reduce the biases from the small number of experiment participants.<br />
# Following Table 2, the authors appear to try and claim that the model is better than the human labelers, simply because the model experienced a better increase in classification following the removal of dog photos then the human labeler did, however, a quick look at the table shows that most human labelers still performed better than the best model. The authors should be making the claim that human labelers are better at labeling dogs than the modal, but are still better overall after removing the dogs dataset.<br />
# The reason why human labeler outperforms CNN could be human had much more training. It would be more convincing if the paper could provide a metric in order to measure human labelers' training data set size.<br />
# Actually, in the multi-label case, it is vague to determine whether the machine learning model or the human labellers were giving the correct label. The structure of the dataset is pretty essential in training a network, in which data with uncertain label (even determined by human) should be avoided.<br />
# The authors mentioned that untrained labelers will likely be in lower accuracy, they can give a standard or definition about a well-trained labeler.<br />
# I believe the authors needed to include more information about how they determined the samples such as human samplers, and also more details on how to define unclear images.<br />
# It would be more convincing if the author could provide the criteria of being human samplers and unclear images, and the accuracy of the human labeler.<br />
# The summary only explains some model components but does not thoroughly goes through the big picture of the model; data-preprocessing, training, and prediction procedures. It would be nice to know the details as well.<br />
# It seems the core problem is more about the dataset itself and not the evaluation procedure. We would not have issues with top 1 and top 5 if Imagenet contained discernable classes with good labels. Of course, this is very expensive, and imagenet is an _excellent_ dataset given these constraints. It does not seem like their proposed solution, multiple labels per image, addresses their concerns properly, as other critiques have already mentioned. Furthermore, having multiple labels per image does not translate to real-life value the same way that the top 5 or top 1 metric does, as in the common case, there is one right answer for a classification problem.<br />
# The paper could provide details on ways to improve the accuracy and robustness of the model. Since the paper mentions CNN, it could provide details of the model and why CNN is a good candidate.<br />
# The accuracy of the model is directly correlated with how the images are labelled. In all multi-label annotations, the authors describe a predicted label as correct if it is within a set of "correct labels" where each image has a different number of correct labels. Perhaps it would yield better results if the model were to first identify the number of objects in the image first and then by using some form of criteria, it labels those identified objects in order of importance (i.e. objects that are closer are labelled first). The authors also never specified what criteria the model uses to "pick out" which object it will label in the image.<br />
# The paper mentions difficult images and fast images. It would be better if the paper had generalized the type of images that constitute difficult images (i.e., the paper mentions 118 dog classes, what are some general characteristics of difficult images?) In addition, it would be interesting to compare the performance between human and machine accuracy on non-fast images.<br />
# The paper meaingfully and correctly points out the problem that the current evaluation of ML algorithm by only using the accuracy on Imagnet as the bechmark is simplistic and probelmatic. However, the idea of comparing human performance with ML models is problematic, since it is hard or even impossible to control the various variables that can drastically change human performance: training time, domain knowledge, cognitive function, amount of work-load, and various environmental factors. In order to compare different experimental methods, the most important step is to carefully control the confounding variables to reach any meaningful conclusion.<br />
# The original question would be how ImageNet was created? The images can only be labeled by human labelers. There might be some mistakes and missing details. The combination of human evaluation and machine evaluation might be helpful to resolve this problem.<br />
# Regarding the difficult problems, CatsandDogs is a very popular dataset that in this day it is trivial for a non-deep CNN to classify with high accuracies. The author would do well to test this CNN with other classes, since its inconceivable that a CNN cannot learn the characteristics of a dog with relation to the other classes in Imagenet. Why would it be able to differentiate the breeds of dogs, but not other organisms?<br />
<br />
== Reference ==<br />
[1] Shankar, V., Roelofs, R., Mania, H., Fang, A., Recht, B., & Schmidt, L. (2020). Evaluating Machine Accuracy on ImageNet. ICML 2020.<br />
<br />
[2] Krizhevsky, A., Sutskever, I., & Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. 2. Retrieved from http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Evaluating_Machine_Accuracy_on_ImageNet&diff=49700Evaluating Machine Accuracy on ImageNet2020-12-07T03:38:13Z<p>X277lu: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du<br />
<br />
== Introduction == <br />
ImageNet is one of the most influential dataset in machine learning with images and corresponding labels over 1000 classes. This paper intends to explore the causes for performance differences between human experts and machine learning models, more specifically, CNN, on ImageNet. <br />
<br />
Firstly, some images could belong to multiple classes. As a result, it is possible to underestimate the performance if we assign each image with only one label, which is what is being done in the top-1 metric. On the other hand, the top-5 metric looks at the top five predictions by the model for an image and checks if the target label is within those five predictions (Krizhevsky, Sutskever, & Hinton). Therefore, we adopt both top-1 and top-5 metrics where the performances of models, unlike human labelers, are linearly correlated in both cases.<br />
<br />
Secondly, in contrast to the uniform performance of models in classes, humans tend to achieve better performances on inanimate objects. Human labelers achieve similar overall accuracies as the models, which indicates spaces of improvements on specific classes for machines.<br />
<br />
Lastly, the setup of drawing training and test sets from the same distribution may favor models over human labelers. That is, the accuracy of multi-class prediction from models drops when the testing set is drawn from a different distribution than the training set, ImageNetV2. But this shift in distribution does not cause a problem for human labelers.<br />
<br />
== Experiment Setup ==<br />
=== Overview ===<br />
There are four main phases to the experiment, which are (i) initial multilabel annotation, (ii) human labeler training, (iii) human labeler evaluation, and (iv) final annotation overview. The five authors of the paper are the participants in the experiments. <br />
<br />
A brief overview of the four phases is as follows:<br />
[[File:Experiment Set Up.png |800px| center]]<br />
<br />
=== Initial multi-label annotation ===<br />
Three labelers A, B, and C provided multi-label annotations for a subset from the ImageNet validation set, and all images from the ImageNetV2 test sets. These experiences give A, B, and C extensive experience with the ImageNet dataset. <br />
<br />
=== Human Labeler Training === <br />
All five labelers trained on labeling a subset of the remaining ImageNet images. "Training" the human labelers consisted of teaching the humans the distinctions between very similar classes in the training set. For example, there are 118 classes of "dog" within ImageNet and typical human participants will not have working knowledge of the names of each breed of dog seen even if they can recognize and distinguish that breed from others. Local members of the American Kennel Club were even contacted to help with dog breed classification. To do this labelers were trained on class-specific tasks for groups like dogs, insects, monkeys beaver and others. They were also given immediate feedback on whether they were correct and then were asked where they thought they needed more training to improve. Unlike the two annotators in (Russakovsky et al., 2015), who had insufficient training data, the labelers in this experiment had up to 100 training images per class while labeling. This allowed the labelers to really understand the finer details of each class.<br />
<br />
=== Human Labeler Evaluation ===<br />
Class-balanced random samples, which contain 1,000 images from the 20,000 annotated images are generated from both the ImageNet validation set and ImageNetV2. Five participants labeled these images over 28 days.<br />
<br />
=== Final annotation Review ===<br />
All labelers reviewed the additional annotations generated in the human labeler evaluation phase.<br />
<br />
== Multi-label annotations==<br />
[[File:Categories Multilabel.png|800px|center]]<br />
<div align="center">Figure 3</div><br />
<br />
===Top-1 accuracy===<br />
With Top-1 accuracy being the standard accuracy measure used in classification studies, it measures the proportions of examples for which the predicted label matches the single target label. As many images often contain more than one object for classification, for example, Figure 3a contains a desk, laptop, keyboard, space bar, and more. With Figure 3b showing a centered prominent figure yet labeled otherwise (people vs picket fence), it can be seen how a single target label is inaccurate for such a task since identifying the main objects in the image does not suffice due to its overly stringent and punishes predictions that are the main image yet does not match its label.<br />
===Top-5 accuracy===<br />
With Top-5 considers a classification correct if the object label is in the top 5 predicted labels. Although it partially resolves the problem with Top-1 labeling, it is still not ideal since it can trivialize class distinctions. For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations.<br />
<br />
===Multi-label accuracy===<br />
The paper then proposes that for every image, the image shall have a set of target labels and a prediction; if such prediction matches one of the labels, it will be considered as correct labeling. Due to the above-discussed limitations of Top-1 and Top-5 metrics, the paper claims it is necessary for rigorous accuracy evaluation on the dataset. <br />
<br />
===Types of Multi-label annotations===<br />
====Multiple objects or organisms====<br />
For the images containing more than one object or organism that corresponds to ImageNet, the paper proposed to add an additional target label for each entity in the image. With the discussed image in Figure 3b, the class groom, bow tie, suit, gown, and hoopskirt are all present in the foreground which is then subsequently added to the set of labels.<br />
====Synonym or subset relations====<br />
For similar classes, the paper considers them as under the same bigger class, that is, for two similarly labeled images, classification is considered correct if the produced label matches either one of the labels. For instance, warthog, African elephant, and Indian element all have prominent tusks, they will be considered subclasses of the tusker, Figure 3c shows a modification of labels to contain tusker as a correct label.<br />
====Unclear Image====<br />
In certain cases such as Figure 3d, there is a distinctive difficulty to determine whether a label was correct due to ambiguities in the class hierarchy.<br />
===Collecting multi-label annotations===<br />
Participants reviewed all predictions made by the models on the dataset ImageNet and ImageNet-V2, the participants then categorized every unique prediction made by the models on the dataset into correct and incorrect labels in order to allow all images to have multiple correct labels to satisfy the above-listed method.<br />
===The multi-label accuracy metric===<br />
One prediction is only correct if and only if it was marked correct by the expert reviewers during the annotation stage. As discussed in the experiment setup section, after human labelers have completed labeling, a second annotation stage is conducted. In Figure 4, a comparison of Top-1, Top-5, and multi-label accuracies showed higher Top-1 and Top-5 accuracy corresponds with higher multi-label accuracy as expected. With multi-label accuracies measures consistently higher than Top-1 yet lower than Top-5 which shows a high correlation between the three metrics, the paper concludes that multi-label metrics measures a semantically more meaningful notion of accuracy compared to its counterparts.<br />
<br />
== Human Accuracy Measurement Process ==<br />
=== Bias Control ===<br />
Since three participants participated in the initial round of annotation, they did not look at the data for six months, and two additional annotators are introduced in the final evaluation phase to ensure fairness of the experiment. <br />
<br />
=== Human Labeler Training ===<br />
The three main difficulties encountered during human labeler training are fine-grained distinctions, class unawareness, and insufficient training images. Thus, three training regimens are provided to address the problems listed above, respectively. First, labelers will be assigned extra training tasks with immediate feedbacks on similar classes. Second, labelers will be provided access to search for specific classes during labeling. Finally, the training set will contain a reasonable amount of images for each class.<br />
<br />
=== Labeling Guide ===<br />
A labeling guide is constructed to distill class analysis learned during training into discriminative traits that could be used as a reference during the final labeling evaluation.<br />
<br />
=== Final Evaluation and Review ===<br />
Two samples, each containing 1000 images, are sampled from ImageNet and ImageNetV2, respectively, They are sampled in a class-balanced manner and shuffled together. Over 28 days, all five participants labeled all images. They spent a median of 26 seconds per image. After labeling is completed, an additional multi-label annotation session was conducted, in which human predictions for all images are manually reviewed. Comparing to the initial round of labeling, 37% of the labels changes due to participants' greater familiarity with the classes.<br />
<br />
== Main Results ==<br />
[[File:Evaluating Machine Accuracy on ImageNet Figure 1.png | center]]<br />
<br />
<div align="center">Figure 1</div><br />
<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
From Figure 1, we can see that the difference in accuracies between the datasets is within 1% for all human participants. As hypothesized, human testers indeed performed better than the automated models on both datasets. It's worth noticing that labelers D and E, who did not participate in the initial annotation period, actually performed better than the best automated model.<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
Based on the results shown in Figure 1, we can see that the confidence interval of the best 4 human participants and 4 best model overlap; however, with a p-value of 0.037 using the McNemar's paired test, it rejects the hypothesis that the FixResNeXt model and Human E labeler have the same accuracy with respect to the ImageNet validation dataset. Figure 1 also shows that the confidence intervals of the labeling accuracies for human labelers C, D, E do not overlap with the confidence interval of the best model with respect to ImageNet-V2 and with the McNemar's test yielding a p-value of <math>2\times 10^{-4}</math>, it is clear that the hypothesis human and machined models have same robustness to model distribution shifts ought to be rejected.<br />
<br />
== Other Observations ==<br />
<br />
[[File: Results_Summary_Table.png| 800px|center]]<br />
<br />
=== Difficult Images ===<br />
<br />
The experiment also shed some light on images that are difficult to label. 10 images were misclassified by all of the human labelers. Among those 10 images, there was 1 image of a monkey and 9 of dogs. In addition, 27 images, with 19 in object classes and 8 in organism classes, were misclassified by all 72 machine learning models in this experiment. Only 2 images were labeled wrong by all human labelers and models. Both images contained dogs. Researchers also noted that difficult images for models are mostly images of objects and exclusively images of animals for human labelers.<br />
<br />
=== Accuracies without dogs ===<br />
<br />
As previously discussed in the paper, machine learning models tend to outperform human labelers when classifying the 118 dog classes. To better understand to what extent does models outperform human labelers, researchers computed the accuracies again by excluding all the dog classes. Results showed a 0.6% increase in accuracy on the ImageNet images using the best model and a 1.1% increase on the ImageNet V2 images. In comparison, the mean increases in accuracy for human labelers are 1.9% and 1.8% on the ImageNet and ImageNet V2 images respectively. Researchers also conducted a simulation to demonstrate that the increase in human labeling accuracy on non-dog images is significant. This simulation was done by bootstrapping to estimate the changes in accuracy when only using data for the non-dog classes, and simulation results show smaller increases than in the experiment. <br />
<br />
In conclusion, it's more difficult for human labelers to classify images with dogs than it is for machine learning models.<br />
<br />
=== Accuracies on objects ===<br />
Researchers also computed machine and human labelers' accuracies on a subset of data with only objects, as opposed to organisms, to better illustrate the differences in performance. This test involved 590 object classes. As shown in the table above, there is a 3.3% and 3.4% increase in mean accuracies for human labelers on the ImageNet and ImageNet V2 images. In contrast, there is a 0.5% decrease in accuracy for the best model on both ImageNet and ImageNet V2. This indicates that human labelers are much better at classifying objects than these models are.<br />
<br />
=== Accuracies on fast images ===<br />
Unlike the CNN models, human labelers spent different amounts of time on different images, spanning from several seconds to 40 minutes. To further analyze the images that take human labelers less time to classify, researchers took a subset of images with median labeling time spent by human labelers of at most 60 seconds. These images were referred to as "fast images". There are 756 and 714 fast images from ImageNet and ImageNet V2 respectively, out of the total 2000 images used for evaluation. Accuracies of models and humans on the fast images increased significantly, especially for humans. <br />
<br />
This result suggests that human labelers know when an image is difficult to label and would spend more time on it. It also shows that the models are more likely to correctly label images that human labelers can label relatively quickly.<br />
<br />
== Related Work ==<br />
<br />
=== Human accuracy on ImageNet ===<br />
<br />
Russakovsky et al. (2015) studied two trained human labelers' accuracies on 1500 and 258 images in the context of the ImageNet challenge. The top-5 accuracy of the labeler who labeled 1500 images was the well-known human baseline on ImageNet. <br />
<br />
As introduced before, the researchers went beyond by using multi-label accuracy, using more labelers, and focusing on robustness to small distribution shifts. Although the researchers had some different findings, some results are also consistent with results from (Russakovsky et al., 2015). An example is that both experiments indicated that it takes human labelers around one minute to label an image. The time distribution also has a long tail, due to the difficult images as mentioned before.<br />
<br />
=== Human performance in computer vision broadly ===<br />
There are many examples of recent studies about humans in the area of computer vision, such as investigating human robustness to synthetic distribution change (Geirhos et al., 2017) and studying what characteristics do humans use to recognize objects (Geirhos et al., 2018). Other examples include the adversarial examples constructed to fool both machines and time-limited humans (Elsayed et al., 2018) and illustrating foreground/background objects' effects on human and machine performance (Zhu et al., 2016). <br />
<br />
=== Multi-label annotations ===<br />
Stock & Cissé (2017) also studied ImageNet's multi-label nature, which aligns with the researchers' study in this paper. According to Stock & Cissé (2017), the top-1 accuracy measure could underestimate multi-label by up to 13.2%. The authors suggest that releasing these labelled data to the public will allow for more robust models in the future.<br />
<br />
=== ImageNet inconsistencies and label error ===<br />
Researches have found and recorded some incorrectly labeled images from ImageNet and ImageNet V2 during this study. Earlier studies (Van Horn et al., 2015) also shown that at least 4% of the birds in ImageNet are misclassified. This work also noted that the inconsistent taxonomic structure in birds' classes could lead to weak class boundaries. Researchers also noted that the majority of the fine-grained organism classes also had similar taxonomic issues.<br />
<br />
=== Distribution shift ===<br />
There has been an increasing amount of studies in this area. One focus of the studies is distributionally robust optimization (DRO), which finds the model that has the smallest worst-case expected error over a set of probability distributions. Another focus is on finding the model with the lowest error rates on adversarial examples. Work in both areas has been productive, but none was shown to resolve the drop in accuracies between ImageNet and ImageNet V2. A recent [https://papers.nips.cc/paper/2019/file/8558cb408c1d76621371888657d2eb1d-Paper.pdf paper] also discusses quantifying uncertainty under a distribution shift, in other words whether the output of probabilistic deep learning models should or should not be trusted.<br />
<br />
== Conclusion and Future Work ==<br />
<br />
=== Conclusion ===<br />
Researchers noted that in order to achieve truly reliable machine learning, they need a deeper understanding of the range of parameters where the model still remain robust. Techniques from Combinatorics and sensitivity analysis, in particular, might yield fruitful results. This study has provided valuable insights into the desired robustness properties by comparing model performance to human performance. This is especially evident given the results of the experiment which show humans drastically outperforming machine learning in many cases and proposes the question of how much accuracy one is willing to give up in exchange for efficiency. The results have shown that current performance benchmarks are not addressing the robustness to small and natural distribution shifts, which are easily handled by humans.<br />
<br />
=== Future work ===<br />
Other than improving the robustness of models, researchers should consider investigating if less-trained human labelers can achieve a similar level of robustness to distributional shifts. In addition, researchers can study the robustness to temporal changes, which is another form of natural distribution shift (Gu et al., 2019; Shankar et al., 2019). Also, Convolutional Neural Network can be a candidate to improve the accuracy of classifying images.<br />
<br />
== Critiques ==<br />
# The method of using human to classify Imagenet is fully circular, since the label of imagenet itself is originally annotated by human beings. In fact, the classification scheme itself is intrinsically human construction. It is not logical to test human performance with human performance. This circular contsruction completely violates scientific principles.<br />
# Table 1 simply showed a difference in ImageNet multi-label accuracy yet does not give an explicit reason as to why such a difference is present. Although the paper suggested the distribution shift has caused the difference, it does not give other factors to concretely explain why the distribution shift was the cause.<br />
# With the recommendation to future machine evaluations, the paper proposed to "Report performances on dogs, other animals, and inanimate objects separately.". Despite its intentions, it is narrowly specific and requires further generalization for it to be convincing. <br />
# With choosing human subjects as samplers, no further information was given as to how they are chosen nor there are any background information was given. As it is a classification problem involving many classes as specific to species, a biology student would give far more accurate results than a computer science student or a math student. <br />
# As explaining the importance of multi-label metrics using comparison to Top-5 metric, the turtle example falls within the overall similarity (simony) classification of the multi-label evaluation metric, as such, if the Top-5 evaluation suggests any one of the turtle species were selected, the algorithm is considered to produce a correct prediction which is the intention. The example does not convey the necessity of changing to the proposed metric over the Top-5 metric. <br />
# With the definition in the paper regarding multi-label metrics, it is hard to see why expanding the label set is different from a traditional Top-5 metric or rather necessary, ergo does not yield the claim which the proposed metric is necessary for rigorous accuracy evaluation on ImageNet.<br />
# When discussing the main results, the paper discusses the hypothesis on distribution shift having no effects on human and machine model accuracies; the presentation is poor at best with no clear centric to what they are trying to convey to how (in detail) they resulted in such claims.<br />
# In the experiment setup of the presentation, there are a lot of key terms without detailed description. For example, Human labeler training using a subset of the remaining 30,000 unannotated images in the ImageNet validation set, labelers A, B, C, D, and E underwent extensive training to understand the intricacies of fine-grained class distinctions in the ImageNet class hierarchy. Authors should clarify each key term in the presentation otherwise readers are hard to follow.<br />
# Not sure how the human samplers were determined and simply picking several people will have really high bias because the sample is too small and they have different background which will definitely affect the results a lot. Also, it will be better if there are more comparisons between the model introduced and other models.<br />
# Given the low amount of human participants, it is hard to take the results seriously (there is too much variance). Also it's not exactly clear how the authors determined that the multi-label accuracy metric measures a semantically more meaningful notion of accuracy compared to its counterparts. For example, one of the issues with top-5 accuracy that they mention is: "For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations." But it's not clear how multi-label accuracy would be better in this instance.<br />
# It is unclear how well the human labeler can perform labeling after training. So the final result is not that trust-worthy.<br />
# In this experiment set up, label annotators are the same as participants of the experiments. Even if there's a break between the annotating and evaluating human labeler evaluation, the impact of the break in reducing bias is not clear. One potential human labeling data is google's "I'm not a robot" verification test. One variation of the verification test asks users to select all the photos from 9 images that are related to a certain keyword. This allows for a more accurate measurement of human performance vs ImageNet performance. In addition, it's going to reduce the biases from the small number of experiment participants.<br />
# Following Table 2, the authors appear to try and claim that the model is better than the human labelers, simply because the model experienced a better increase in classification following the removal of dog photos then the human labeler did, however, a quick look at the table shows that most human labelers still performed better than the best model. The authors should be making the claim that human labelers are better at labeling dogs than the modal, but are still better overall after removing the dogs dataset.<br />
# The reason why human labeler outperforms CNN could be human had much more training. It would be more convincing if the paper could provide a metric in order to measure human labelers' training data set size.<br />
# Actually, in the multi-label case, it is vague to determine whether the machine learning model or the human labellers were giving the correct label. The structure of the dataset is pretty essential in training a network, in which data with uncertain label (even determined by human) should be avoided.<br />
# The authors mentioned that untrained labelers will likely be in lower accuracy, they can give a standard or definition about a well-trained labeler.<br />
# I believe the authors needed to include more information about how they determined the samples such as human samplers, and also more details on how to define unclear images.<br />
# It would be more convincing if the author could provide the criteria of being human samplers and unclear images, and the accuracy of the human labeler.<br />
# The summary only explains some model components but does not thoroughly goes through the big picture of the model; data-preprocessing, training, and prediction procedures. It would be nice to know the details as well.<br />
# It seems the core problem is more about the dataset itself and not the evaluation procedure. We would not have issues with top 1 and top 5 if Imagenet contained discernable classes with good labels. Of course, this is very expensive, and imagenet is an _excellent_ dataset given these constraints. It does not seem like their proposed solution, multiple labels per image, addresses their concerns properly, as other critiques have already mentioned. Furthermore, having multiple labels per image does not translate to real-life value the same way that the top 5 or top 1 metric does, as in the common case, there is one right answer for a classification problem.<br />
# The paper could provide details on ways to improve the accuracy and robustness of the model. Since the paper mentions CNN, it could provide details of the model and why CNN is a good candidate.<br />
# The accuracy of the model is directly correlated with how the images are labelled. In all multi-label annotations, the authors describe a predicted label as correct if it is within a set of "correct labels" where each image has a different number of correct labels. Perhaps it would yield better results if the model were to first identify the number of objects in the image first and then by using some form of criteria, it labels those identified objects in order of importance (i.e. objects that are closer are labelled first). The authors also never specified what criteria the model uses to "pick out" which object it will label in the image.<br />
# The paper mentions difficult images and fast images. It would be better if the paper had generalized the type of images that constitute difficult images (i.e., the paper mentions 118 dog classes, what are some general characteristics of difficult images?) In addition, it would be interesting to compare the performance between human and machine accuracy on non-fast images.<br />
# The paper meaingfully and correctly points out the problem that the current evaluation of ML algorithm by only using the accuracy on Imagnet as the bechmark is simplistic and probelmatic. However, the idea of comparing human performance with ML models is problematic, since it is hard or even impossible to control the various variables that can drastically change human performance: training time, domain knowledge, cognitive function, amount of work-load, and various environmental factors. In order to compare different experimental methods, the most important step is to carefully control the confounding variables to reach any meaningful conclusion.<br />
# The original question would be how ImageNet was created? The images can only be labeled by human labelers. There might be some mistakes and missing details. The combination of human evaluation and machine evaluation might be helpful to resolve this problem.<br />
# Regarding the difficult problems, CatsandDogs is a very popular dataset that in this day it is trivial for a non-deep CNN to classify with high accuracies. The author would do well to test this CNN with other classes, since its inconceivable that a CNN cannot learn the characteristics of a dog with relation to the other classes in Imagenet. Why would it be able to differentiate the breeds of dogs, but not other organisms?<br />
<br />
== Reference ==<br />
[1] Shankar, V., Roelofs, R., Mania, H., Fang, A., Recht, B., & Schmidt, L. (2020). Evaluating Machine Accuracy on ImageNet. ICML 2020.<br />
<br />
[2] Krizhevsky, A., Sutskever, I., & Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. 2. Retrieved from http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Evaluating_Machine_Accuracy_on_ImageNet&diff=49699Evaluating Machine Accuracy on ImageNet2020-12-07T03:36:45Z<p>X277lu: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du<br />
<br />
== Introduction == <br />
ImageNet is one of the most influential dataset in machine learning with images and corresponding labels over 1000 classes. This paper intends to explore the causes for performance differences between human experts and machine learning models, more specifically, CNN, on ImageNet. <br />
<br />
Firstly, some images could belong to multiple classes. As a result, it is possible to underestimate the performance if we assign each image with only one label, which is what is being done in the top-1 metric. On the other hand, the top-5 metric looks at the top five predictions by the model for an image and checks if the target label is within those five predictions (Krizhevsky, Sutskever, & Hinton). Therefore, we adopt both top-1 and top-5 metrics where the performances of models, unlike human labelers, are linearly correlated in both cases.<br />
<br />
Secondly, in contrast to the uniform performance of models in classes, humans tend to achieve better performances on inanimate objects. Human labelers achieve similar overall accuracies as the models, which indicates spaces of improvements on specific classes for machines.<br />
<br />
Lastly, the setup of drawing training and test sets from the same distribution may favor models over human labelers. That is, the accuracy of multi-class prediction from models drops when the testing set is drawn from a different distribution than the training set, ImageNetV2. But this shift in distribution does not cause a problem for human labelers.<br />
<br />
== Experiment Setup ==<br />
=== Overview ===<br />
There are four main phases to the experiment, which are (i) initial multilabel annotation, (ii) human labeler training, (iii) human labeler evaluation, and (iv) final annotation overview. The five authors of the paper are the participants in the experiments. <br />
<br />
A brief overview of the four phases is as follows:<br />
[[File:Experiment Set Up.png |800px| center]]<br />
<br />
=== Initial multi-label annotation ===<br />
Three labelers A, B, and C provided multi-label annotations for a subset from the ImageNet validation set, and all images from the ImageNetV2 test sets. These experiences give A, B, and C extensive experience with the ImageNet dataset. <br />
<br />
=== Human Labeler Training === <br />
All five labelers trained on labeling a subset of the remaining ImageNet images. "Training" the human labelers consisted of teaching the humans the distinctions between very similar classes in the training set. For example, there are 118 classes of "dog" within ImageNet and typical human participants will not have working knowledge of the names of each breed of dog seen even if they can recognize and distinguish that breed from others. Local members of the American Kennel Club were even contacted to help with dog breed classification. To do this labelers were trained on class-specific tasks for groups like dogs, insects, monkeys beaver and others. They were also given immediate feedback on whether they were correct and then were asked where they thought they needed more training to improve. Unlike the two annotators in (Russakovsky et al., 2015), who had insufficient training data, the labelers in this experiment had up to 100 training images per class while labeling. This allowed the labelers to really understand the finer details of each class.<br />
<br />
=== Human Labeler Evaluation ===<br />
Class-balanced random samples, which contain 1,000 images from the 20,000 annotated images are generated from both the ImageNet validation set and ImageNetV2. Five participants labeled these images over 28 days.<br />
<br />
=== Final annotation Review ===<br />
All labelers reviewed the additional annotations generated in the human labeler evaluation phase.<br />
<br />
== Multi-label annotations==<br />
[[File:Categories Multilabel.png|800px|center]]<br />
<div align="center">Figure 3</div><br />
<br />
===Top-1 accuracy===<br />
With Top-1 accuracy being the standard accuracy measure used in classification studies, it measures the proportions of examples for which the predicted label matches the single target label. As many images often contain more than one object for classification, for example, Figure 3a contains a desk, laptop, keyboard, space bar, and more. With Figure 3b showing a centered prominent figure yet labeled otherwise (people vs picket fence), it can be seen how a single target label is inaccurate for such a task since identifying the main objects in the image does not suffice due to its overly stringent and punishes predictions that are the main image yet does not match its label.<br />
===Top-5 accuracy===<br />
With Top-5 considers a classification correct if the object label is in the top 5 predicted labels. Although it partially resolves the problem with Top-1 labeling, it is still not ideal since it can trivialize class distinctions. For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations.<br />
<br />
===Multi-label accuracy===<br />
The paper then proposes that for every image, the image shall have a set of target labels and a prediction; if such prediction matches one of the labels, it will be considered as correct labeling. Due to the above-discussed limitations of Top-1 and Top-5 metrics, the paper claims it is necessary for rigorous accuracy evaluation on the dataset. <br />
<br />
===Types of Multi-label annotations===<br />
====Multiple objects or organisms====<br />
For the images containing more than one object or organism that corresponds to ImageNet, the paper proposed to add an additional target label for each entity in the image. With the discussed image in Figure 3b, the class groom, bow tie, suit, gown, and hoopskirt are all present in the foreground which is then subsequently added to the set of labels.<br />
====Synonym or subset relations====<br />
For similar classes, the paper considers them as under the same bigger class, that is, for two similarly labeled images, classification is considered correct if the produced label matches either one of the labels. For instance, warthog, African elephant, and Indian element all have prominent tusks, they will be considered subclasses of the tusker, Figure 3c shows a modification of labels to contain tusker as a correct label.<br />
====Unclear Image====<br />
In certain cases such as Figure 3d, there is a distinctive difficulty to determine whether a label was correct due to ambiguities in the class hierarchy.<br />
===Collecting multi-label annotations===<br />
Participants reviewed all predictions made by the models on the dataset ImageNet and ImageNet-V2, the participants then categorized every unique prediction made by the models on the dataset into correct and incorrect labels in order to allow all images to have multiple correct labels to satisfy the above-listed method.<br />
===The multi-label accuracy metric===<br />
One prediction is only correct if and only if it was marked correct by the expert reviewers during the annotation stage. As discussed in the experiment setup section, after human labelers have completed labeling, a second annotation stage is conducted. In Figure 4, a comparison of Top-1, Top-5, and multi-label accuracies showed higher Top-1 and Top-5 accuracy corresponds with higher multi-label accuracy as expected. With multi-label accuracies measures consistently higher than Top-1 yet lower than Top-5 which shows a high correlation between the three metrics, the paper concludes that multi-label metrics measures a semantically more meaningful notion of accuracy compared to its counterparts.<br />
<br />
== Human Accuracy Measurement Process ==<br />
=== Bias Control ===<br />
Since three participants participated in the initial round of annotation, they did not look at the data for six months, and two additional annotators are introduced in the final evaluation phase to ensure fairness of the experiment. <br />
<br />
=== Human Labeler Training ===<br />
The three main difficulties encountered during human labeler training are fine-grained distinctions, class unawareness, and insufficient training images. Thus, three training regimens are provided to address the problems listed above, respectively. First, labelers will be assigned extra training tasks with immediate feedbacks on similar classes. Second, labelers will be provided access to search for specific classes during labeling. Finally, the training set will contain a reasonable amount of images for each class.<br />
<br />
=== Labeling Guide ===<br />
A labeling guide is constructed to distill class analysis learned during training into discriminative traits that could be used as a reference during the final labeling evaluation.<br />
<br />
=== Final Evaluation and Review ===<br />
Two samples, each containing 1000 images, are sampled from ImageNet and ImageNetV2, respectively, They are sampled in a class-balanced manner and shuffled together. Over 28 days, all five participants labeled all images. They spent a median of 26 seconds per image. After labeling is completed, an additional multi-label annotation session was conducted, in which human predictions for all images are manually reviewed. Comparing to the initial round of labeling, 37% of the labels changes due to participants' greater familiarity with the classes.<br />
<br />
== Main Results ==<br />
[[File:Evaluating Machine Accuracy on ImageNet Figure 1.png | center]]<br />
<br />
<div align="center">Figure 1</div><br />
<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
From Figure 1, we can see that the difference in accuracies between the datasets is within 1% for all human participants. As hypothesized, human testers indeed performed better than the automated models on both datasets. It's worth noticing that labelers D and E, who did not participate in the initial annotation period, actually performed better than the best automated model.<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
Based on the results shown in Figure 1, we can see that the confidence interval of the best 4 human participants and 4 best model overlap; however, with a p-value of 0.037 using the McNemar's paired test, it rejects the hypothesis that the FixResNeXt model and Human E labeler have the same accuracy with respect to the ImageNet validation dataset. Figure 1 also shows that the confidence intervals of the labeling accuracies for human labelers C, D, E do not overlap with the confidence interval of the best model with respect to ImageNet-V2 and with the McNemar's test yielding a p-value of <math>2\times 10^{-4}</math>, it is clear that the hypothesis human and machined models have same robustness to model distribution shifts ought to be rejected.<br />
<br />
== Other Observations ==<br />
<br />
[[File: Results_Summary_Table.png| 800px|center]]<br />
<br />
=== Difficult Images ===<br />
<br />
The experiment also shed some light on images that are difficult to label. 10 images were misclassified by all of the human labelers. Among those 10 images, there was 1 image of a monkey and 9 of dogs. In addition, 27 images, with 19 in object classes and 8 in organism classes, were misclassified by all 72 machine learning models in this experiment. Only 2 images were labeled wrong by all human labelers and models. Both images contained dogs. Researchers also noted that difficult images for models are mostly images of objects and exclusively images of animals for human labelers.<br />
<br />
=== Accuracies without dogs ===<br />
<br />
As previously discussed in the paper, machine learning models tend to outperform human labelers when classifying the 118 dog classes. To better understand to what extent does models outperform human labelers, researchers computed the accuracies again by excluding all the dog classes. Results showed a 0.6% increase in accuracy on the ImageNet images using the best model and a 1.1% increase on the ImageNet V2 images. In comparison, the mean increases in accuracy for human labelers are 1.9% and 1.8% on the ImageNet and ImageNet V2 images respectively. Researchers also conducted a simulation to demonstrate that the increase in human labeling accuracy on non-dog images is significant. This simulation was done by bootstrapping to estimate the changes in accuracy when only using data for the non-dog classes, and simulation results show smaller increases than in the experiment. <br />
<br />
In conclusion, it's more difficult for human labelers to classify images with dogs than it is for machine learning models.<br />
<br />
=== Accuracies on objects ===<br />
Researchers also computed machine and human labelers' accuracies on a subset of data with only objects, as opposed to organisms, to better illustrate the differences in performance. This test involved 590 object classes. As shown in the table above, there is a 3.3% and 3.4% increase in mean accuracies for human labelers on the ImageNet and ImageNet V2 images. In contrast, there is a 0.5% decrease in accuracy for the best model on both ImageNet and ImageNet V2. This indicates that human labelers are much better at classifying objects than these models are.<br />
<br />
=== Accuracies on fast images ===<br />
Unlike the CNN models, human labelers spent different amounts of time on different images, spanning from several seconds to 40 minutes. To further analyze the images that take human labelers less time to classify, researchers took a subset of images with median labeling time spent by human labelers of at most 60 seconds. These images were referred to as "fast images". There are 756 and 714 fast images from ImageNet and ImageNet V2 respectively, out of the total 2000 images used for evaluation. Accuracies of models and humans on the fast images increased significantly, especially for humans. <br />
<br />
This result suggests that human labelers know when an image is difficult to label and would spend more time on it. It also shows that the models are more likely to correctly label images that human labelers can label relatively quickly.<br />
<br />
== Related Work ==<br />
<br />
=== Human accuracy on ImageNet ===<br />
<br />
Russakovsky et al. (2015) studied two trained human labelers' accuracies on 1500 and 258 images in the context of the ImageNet challenge. The top-5 accuracy of the labeler who labeled 1500 images was the well-known human baseline on ImageNet. <br />
<br />
As introduced before, the researchers went beyond by using multi-label accuracy, using more labelers, and focusing on robustness to small distribution shifts. Although the researchers had some different findings, some results are also consistent with results from (Russakovsky et al., 2015). An example is that both experiments indicated that it takes human labelers around one minute to label an image. The time distribution also has a long tail, due to the difficult images as mentioned before.<br />
<br />
=== Human performance in computer vision broadly ===<br />
There are many examples of recent studies about humans in the area of computer vision, such as investigating human robustness to synthetic distribution change (Geirhos et al., 2017) and studying what characteristics do humans use to recognize objects (Geirhos et al., 2018). Other examples include the adversarial examples constructed to fool both machines and time-limited humans (Elsayed et al., 2018) and illustrating foreground/background objects' effects on human and machine performance (Zhu et al., 2016). <br />
<br />
=== Multi-label annotations ===<br />
Stock & Cissé (2017) also studied ImageNet's multi-label nature, which aligns with the researchers' study in this paper. According to Stock & Cissé (2017), the top-1 accuracy measure could underestimate multi-label by up to 13.2%. The authors suggest that releasing these labelled data to the public will allow for more robust models in the future.<br />
<br />
=== ImageNet inconsistencies and label error ===<br />
Researches have found and recorded some incorrectly labeled images from ImageNet and ImageNet V2 during this study. Earlier studies (Van Horn et al., 2015) also shown that at least 4% of the birds in ImageNet are misclassified. This work also noted that the inconsistent taxonomic structure in birds' classes could lead to weak class boundaries. Researchers also noted that the majority of the fine-grained organism classes also had similar taxonomic issues.<br />
<br />
=== Distribution shift ===<br />
There has been an increasing amount of studies in this area. One focus of the studies is distributionally robust optimization (DRO), which finds the model that has the smallest worst-case expected error over a set of probability distributions. Another focus is on finding the model with the lowest error rates on adversarial examples. Work in both areas has been productive, but none was shown to resolve the drop in accuracies between ImageNet and ImageNet V2. A recent [https://papers.nips.cc/paper/2019/file/8558cb408c1d76621371888657d2eb1d-Paper.pdf paper] also discusses quantifying uncertainty under a distribution shift, in other words whether the output of probabilistic deep learning models should or should not be trusted.<br />
<br />
== Conclusion and Future Work ==<br />
<br />
=== Conclusion ===<br />
Researchers noted that in order to achieve truly reliable machine learning, researchers need a deeper understanding of the range of parameters where the model still remain robust. Techniques from Combinatorics and sensitivity analysis, in particular, might yield fruitful results. This study has provided valuable insights into the desired robustness properties by comparing model performance to human performance. This is especially evident given the results of the experiment which show humans drastically outperforming machine learning in many cases and proposes the question of how much accuracy one is willing to give up in exchange for efficiency. The results have shown that current performance benchmarks are not addressing the robustness to small and natural distribution shifts, which are easily handled by humans.<br />
<br />
=== Future work ===<br />
Other than improving the robustness of models, researchers should consider investigating if less-trained human labelers can achieve a similar level of robustness to distributional shifts. In addition, researchers can study the robustness to temporal changes, which is another form of natural distribution shift (Gu et al., 2019; Shankar et al., 2019). Also, Convolutional Neural Network can be a candidate to improve the accuracy of classifying images.<br />
<br />
== Critiques ==<br />
# The method of using human to classify Imagenet is fully circular, since the label of imagenet itself is originally annotated by human beings. In fact, the classification scheme itself is intrinsically human construction. It is not logical to test human performance with human performance. This circular contsruction completely violates scientific principles.<br />
# Table 1 simply showed a difference in ImageNet multi-label accuracy yet does not give an explicit reason as to why such a difference is present. Although the paper suggested the distribution shift has caused the difference, it does not give other factors to concretely explain why the distribution shift was the cause.<br />
# With the recommendation to future machine evaluations, the paper proposed to "Report performances on dogs, other animals, and inanimate objects separately.". Despite its intentions, it is narrowly specific and requires further generalization for it to be convincing. <br />
# With choosing human subjects as samplers, no further information was given as to how they are chosen nor there are any background information was given. As it is a classification problem involving many classes as specific to species, a biology student would give far more accurate results than a computer science student or a math student. <br />
# As explaining the importance of multi-label metrics using comparison to Top-5 metric, the turtle example falls within the overall similarity (simony) classification of the multi-label evaluation metric, as such, if the Top-5 evaluation suggests any one of the turtle species were selected, the algorithm is considered to produce a correct prediction which is the intention. The example does not convey the necessity of changing to the proposed metric over the Top-5 metric. <br />
# With the definition in the paper regarding multi-label metrics, it is hard to see why expanding the label set is different from a traditional Top-5 metric or rather necessary, ergo does not yield the claim which the proposed metric is necessary for rigorous accuracy evaluation on ImageNet.<br />
# When discussing the main results, the paper discusses the hypothesis on distribution shift having no effects on human and machine model accuracies; the presentation is poor at best with no clear centric to what they are trying to convey to how (in detail) they resulted in such claims.<br />
# In the experiment setup of the presentation, there are a lot of key terms without detailed description. For example, Human labeler training using a subset of the remaining 30,000 unannotated images in the ImageNet validation set, labelers A, B, C, D, and E underwent extensive training to understand the intricacies of fine-grained class distinctions in the ImageNet class hierarchy. Authors should clarify each key term in the presentation otherwise readers are hard to follow.<br />
# Not sure how the human samplers were determined and simply picking several people will have really high bias because the sample is too small and they have different background which will definitely affect the results a lot. Also, it will be better if there are more comparisons between the model introduced and other models.<br />
# Given the low amount of human participants, it is hard to take the results seriously (there is too much variance). Also it's not exactly clear how the authors determined that the multi-label accuracy metric measures a semantically more meaningful notion of accuracy compared to its counterparts. For example, one of the issues with top-5 accuracy that they mention is: "For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations." But it's not clear how multi-label accuracy would be better in this instance.<br />
# It is unclear how well the human labeler can perform labeling after training. So the final result is not that trust-worthy.<br />
# In this experiment set up, label annotators are the same as participants of the experiments. Even if there's a break between the annotating and evaluating human labeler evaluation, the impact of the break in reducing bias is not clear. One potential human labeling data is google's "I'm not a robot" verification test. One variation of the verification test asks users to select all the photos from 9 images that are related to a certain keyword. This allows for a more accurate measurement of human performance vs ImageNet performance. In addition, it's going to reduce the biases from the small number of experiment participants.<br />
# Following Table 2, the authors appear to try and claim that the model is better than the human labelers, simply because the model experienced a better increase in classification following the removal of dog photos then the human labeler did, however, a quick look at the table shows that most human labelers still performed better than the best model. The authors should be making the claim that human labelers are better at labeling dogs than the modal, but are still better overall after removing the dogs dataset.<br />
# The reason why human labeler outperforms CNN could be human had much more training. It would be more convincing if the paper could provide a metric in order to measure human labelers' training data set size.<br />
# Actually, in the multi-label case, it is vague to determine whether the machine learning model or the human labellers were giving the correct label. The structure of the dataset is pretty essential in training a network, in which data with uncertain label (even determined by human) should be avoided.<br />
# The authors mentioned that untrained labelers will likely be in lower accuracy, they can give a standard or definition about a well-trained labeler.<br />
# I believe the authors needed to include more information about how they determined the samples such as human samplers, and also more details on how to define unclear images.<br />
# It would be more convincing if the author could provide the criteria of being human samplers and unclear images, and the accuracy of the human labeler.<br />
# The summary only explains some model components but does not thoroughly goes through the big picture of the model; data-preprocessing, training, and prediction procedures. It would be nice to know the details as well.<br />
# It seems the core problem is more about the dataset itself and not the evaluation procedure. We would not have issues with top 1 and top 5 if Imagenet contained discernable classes with good labels. Of course, this is very expensive, and imagenet is an _excellent_ dataset given these constraints. It does not seem like their proposed solution, multiple labels per image, addresses their concerns properly, as other critiques have already mentioned. Furthermore, having multiple labels per image does not translate to real-life value the same way that the top 5 or top 1 metric does, as in the common case, there is one right answer for a classification problem.<br />
# The paper could provide details on ways to improve the accuracy and robustness of the model. Since the paper mentions CNN, it could provide details of the model and why CNN is a good candidate.<br />
# The accuracy of the model is directly correlated with how the images are labelled. In all multi-label annotations, the authors describe a predicted label as correct if it is within a set of "correct labels" where each image has a different number of correct labels. Perhaps it would yield better results if the model were to first identify the number of objects in the image first and then by using some form of criteria, it labels those identified objects in order of importance (i.e. objects that are closer are labelled first). The authors also never specified what criteria the model uses to "pick out" which object it will label in the image.<br />
# The paper mentions difficult images and fast images. It would be better if the paper had generalized the type of images that constitute difficult images (i.e., the paper mentions 118 dog classes, what are some general characteristics of difficult images?) In addition, it would be interesting to compare the performance between human and machine accuracy on non-fast images.<br />
# The paper meaingfully and correctly points out the problem that the current evaluation of ML algorithm by only using the accuracy on Imagnet as the bechmark is simplistic and probelmatic. However, the idea of comparing human performance with ML models is problematic, since it is hard or even impossible to control the various variables that can drastically change human performance: training time, domain knowledge, cognitive function, amount of work-load, and various environmental factors. In order to compare different experimental methods, the most important step is to carefully control the confounding variables to reach any meaningful conclusion.<br />
# The original question would be how ImageNet was created? The images can only be labeled by human labelers. There might be some mistakes and missing details. The combination of human evaluation and machine evaluation might be helpful to resolve this problem.<br />
# Regarding the difficult problems, CatsandDogs is a very popular dataset that in this day it is trivial for a non-deep CNN to classify with high accuracies. The author would do well to test this CNN with other classes, since its inconceivable that a CNN cannot learn the characteristics of a dog with relation to the other classes in Imagenet. Why would it be able to differentiate the breeds of dogs, but not other organisms?<br />
<br />
== Reference ==<br />
[1] Shankar, V., Roelofs, R., Mania, H., Fang, A., Recht, B., & Schmidt, L. (2020). Evaluating Machine Accuracy on ImageNet. ICML 2020.<br />
<br />
[2] Krizhevsky, A., Sutskever, I., & Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. 2. Retrieved from http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN&diff=49698Mask RCNN2020-12-07T03:35:26Z<p>X277lu: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang<br />
<br />
== Introduction == <br />
Mask RCNN (Region Based Convolutional Neural Networks)[1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision which is important when attempting to identify different objects within the same image.It combines elements from classical computer vision of object detection and semantic segmentation. RCNN base architectures first extract a regional proposal (a region of the image where the object of interest is proposed to lie) and then attempts to classify the object within it. <br />
Mask R-CNN extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. This is done by using a Fully Convolutional Network as each mask branch in a pixel-by-pixel way. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.<br />
<br />
== Visual Perception Tasks == <br />
<br />
Figure 1 shows a visual representation of different types of visual perception tasks:<br />
<br />
- Image Classification: Predict a set of labels to characterize the contents of an input image<br />
<br />
- Object Detection: Localizing objects in an image by building bounding boxes around the object, then proceed to classify them according to a set of classes. The current baseline system for object detection is Fast/Faster R-CNN.<br />
<br />
- Semantic Segmentation: Associate every pixel in an input image with a class label. The common baseline system for semantic segmentation is FCN (Fully Convolutional Network).<br />
<br />
- Instance Segmentation: Associate every pixel in an input image to a specific object. Instance segmentation combines image classification, object detection and semantic segmentation making it a complex task.<br />
<br />
[[File:instance segmentation.png | center]]<br />
<div align="center">Figure 1: Visual Perception tasks</div><br />
<br />
<br />
Mask RCNN is a deep neural network architecture combining multiple state-of-art techniques for the task of Instance Segmentation.<br />
<br />
== Related Architecture to Mask RCNN == <br />
Region Proposal Network: A Region Proposal Network (RPN) proposes candidate object bounding boxes, which is the first step for effective object detection. It takes an image (of any size) as input and outputs a set of rectangular object boxes, each with an objectness score.<br />
<br />
ROI Pooling: The main use of ROI (Region of Interest) Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section. ROI pooling solves the problem of fixed image size requirement for object detection network. The entire image feeds a CNN model to detect RoI on the feature maps.<br />
<br />
[[File:ROI pooling.png | center | 300px]]<br />
<br />
<br />
Faster R-CNN: Faster R-CNN consists of two stages: Region Proposal Network and Fast R-CNN using ROI Pooling. Region Proposal Network proposes candidate object bounding boxes. ROI Pooling, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.<br />
<br />
[[File:FasterRCNN.png | center]]<br />
<div align="center">Figure 2: Faster RCNN architecture</div><br />
<br />
<br />
ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale. Other than FPN, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.<br />
<br />
[[File:ResNetFPN.png | center]]<br />
<div align="center">Figure 3: ResNetFPN architecture</div><br />
<br />
== Model Architecture == <br />
The structure of mask R-CNN is quite similar to the structure of faster R-CNN. <br />
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying the stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI as <math>L=L_{cls}+L_{box}+L_{mask}</math>, where <math>L_{cls}</math>, <math>L_{box}</math>, <math>L_{mask}</math> represent the classification loss, bounding box loss and the average binary cross-entropy loss respectively.<br />
<br />
The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, complete R-CNN pipeline stages involve 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stages 3 and 4 are adjusted to simplify the network procedures.<br />
<br />
The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.<br />
One thing worth noticing is that for other network systems, those masks across classes compete with each other. However, in this particular case with a <br />
per-pixel sigmoid and a binary loss the masks across classes no longer compete, it makes this formula the key for good instance segmentation results.<br />
<br />
=== RoIAlign ===<br />
<br />
This concept is useful in stage 2 where the RoIPool extracts features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to the pixel-to-pixel correspondence. <br />
<br />
The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. <br />
<br />
RoIPool is standard for extracting a small feature map from each RoI. However, it performs quantization before subdividing into spatial bins which are further quantized. Quantization produces misalignments when it comes to predicting pixel accurate masks. Therefore, instead of quantization, the coordinates are computed using bilinear interpolation They use bilinear interpolation to get the exact values of the inputs features at the 4 RoI bins and aggregate the result (using max or average). These results are robust to the sampling location and number of points and to guarantee spatial correspondence.<br />
<br />
The network architectures utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction. <br />
<br />
Some implementation details should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.<br />
<br />
== Results ==<br />
[[File:ExpInstanceSeg.png | center]]<br />
<div align="center">Figure 4: Instance Segmentation Experiments</div><br />
<br />
Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of the art model <br />
<br />
[[File:BoundingBoxExp.png | center]]<br />
<div align="center">Figure 5: Bounding Box Detection Experiments</div><br />
<br />
Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.<br />
<br />
== Ablation Experiments ==<br />
[[File:BackboneExp.png | center]]<br />
<div align="center">Figure 6: Backbone Architecture Experiments</div><br />
<br />
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. <br />
<br />
[[File:MultiVSInde.png | center]]<br />
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div><br />
<br />
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).<br />
<br />
[[File: RoIAlign.png | center]]<br />
<div align="center">Figure 8: RoIAlign Experiments 1</div><br />
<br />
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers. <br />
<br />
[[File: RoIAlignExp.png | center]]<br />
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div><br />
<br />
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.<br />
<br />
[[File:MaskBranchExp.png | center]]<br />
<div align="center">Figure 10: Mask Branch Experiments</div><br />
<br />
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.<br />
<br />
== Human Pose Estimation ==<br />
Mask RCNN can be extended to human pose estimation.<br />
<br />
The simple approach the paper presents is to model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types such as left shoulder, right elbow. The model has minimal knowledge of human pose and this example illustrates the generality of the model.<br />
<br />
[[File:HumanPose.png | center]]<br />
<div align="center">Figure 11: Keypoint Detection Results</div><br />
<br />
== Experiments on Cityscapes ==<br />
The model was also tested on Cityscapes dataset. From this dataset the authors used 2975 annotated images for training, 500 for validation, and 1525 for testing. The instance segmentation task involved eight categories: person, rider, car, truck, bus, train, motorcycle and bicycle. When the Mask R-CNN model was applied to the data it achieved 26.2 AP on the testing data which was an over 30% improvement on the previous best entry. <br />
<br />
<center><br />
[[ File:cityscapeDataset.png ]]<br />
<br />
<br />
Figure 12. Cityscapes Results<br />
</center><br />
<br />
== Conclusion ==<br />
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.<br />
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.<br />
<br />
== Critiques ==<br />
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.<br />
<br />
It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.<br />
<br />
The paper lacks the comparisons of different methods and Mask RNN on unlabeled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.<br />
<br />
The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.<br />
<br />
The Mask RCNN could be a candidate model to do short-term predictions on the physical behaviors of a person, which could be very useful at crime scenes.<br />
<br />
For the most part, instance segmentation is now quite achievable, and it’s time to start thinking about innovative ways of using this idea of doing computer vision algorithms at a pixel by pixel level such as the DensePose algorithm. <br />
<br />
An interesting application of Mask RCNN would be on face recognition from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.<br />
<br />
The main problem for CNN architectures like Mask RCNN is the running time. Due to slow running times, Single Shot Detector algorithms are preferred for applications like video or live stream detections, where a faster running time would mean a better response to changes in frames. It would be beneficial to have a graphical representation of the Mask RCNN running times against single shot detector algorithms such as YOLOv3.<br />
<br />
It is interesting to investigate a solution of embedding instance segmentation with semantic segmentation to improve time performance. Because in many situations, knowing the exact boundary of an object is not necessary.<br />
<br />
<br />
It will be better if we can have more comparisons with other models. It will also be nice if we can have more details about why Mask RCNN can perform better, and how about the efficiency of it?<br />
The authors mentioned that Mask R-CNN is a deep neural network architecture for Instance Segmentation. It's better to include more background information about this task. For example, challenges of this task (e.g. the model will need to take into account the overlapping of objects) and limitations of existing methods.<br />
<br />
It would be interesting to see how a postprocessing step with conditional random fields (CRF) might improve (or not?) segmentation. It would also have been interesting to see the performance of the method with lighter backbones since the backbones used to have a very large inference time which makes them unsuitable for many applications.<br />
<br />
An extension of the application of Mask RCNN in medical AI is to highlight areas of an MRI scan that correlate to certain behavioral/psychological patterns.<br />
<br />
The use of these in medical imaging systems seems rather useful, but it can also be extended to more general CCTV camera systems which can also detect physiological patterns.<br />
<br />
In the Human Pose Estimation section, we assume that Mask RCNN does not have any knowledge of human poses, and all the predictions are based on keypoints on human bodies, for example, left shoulder and right elbow. While in fact we may be able to achieve better performances here because currently this approach is strongly dependent on correct classifications of human body parts. That is, if the model messed up the position of left shoulder, the position estimation will be awful. It is important to remove the dependency on preceding predictions, so that even when previous steps fail, we may still expect a fair performance.<br />
<br />
It will be interesting to see if applying dropout can boost this Mask RCNN architecture's performance.<br />
<br />
It will be interesting if mask RCNN is applied to human faces and how it classify each individual also would be nice to see how the technical calculations such as classification and predictions are done.<br />
<br />
It would be interesting to know how the RCNN model will perform on unbalanced data and how the performance compares with other models in this circumstances.<br />
<br />
The authors omitted the details of the training and the computational cost of training the model. Since RCNN combines stages 3 and 4 (SVM to categorize and bounding box regression), how does this affect the computational cost of the model? Similar architectures to the RCNN have long training times so it is of interest to know the computational runtime of this model in comparison to other models.<br />
<br />
It's amazing what these researchers were able to achieve with adding minimal overhead, and how well it generalizes using two completely different datasets. For the future work it would be nice to see if the model is able to also predict the distance between the objects that overlap in an image, without adding any further significant overhead. <br />
<br />
Additionally, it would be nice to see how well the model is able to predict collision detection between the objects given that it is currently at 5 frames-per-second (which is still really impressive, it just would be interesting to see how much would be possible)<br />
<br />
It would be more helpful if metrics of measuring classification errors are introduced briefly. Otherwise, it can be confusing.<br />
<br />
== Interesting Directions ==<br />
<br />
There is recent work on ResNeSt: Split-Attention Networks (https://arxiv.org/abs/2004.08955), which uses an explicit soft attention mechanism over channels within a ResNeXt style architecture which shows improvements to classification. It would be interesting to use this backbone with Mask R-CNN and see if the attention helps capture longer range dependencies and thus produce better segmentations.<br />
<br />
== References ==<br />
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.<br />
<br />
[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.<br />
<br />
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN&diff=49697Mask RCNN2020-12-07T03:33:15Z<p>X277lu: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang<br />
<br />
== Introduction == <br />
Mask RCNN (Region Based Convolutional Neural Networks)[1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision which is important when attempting to identify different objects within the same image.It combines elements from classical computer vision of object detection and semantic segmentation. RCNN base architectures first extract a regional proposal (a region of the image where the object of interest is proposed to lie) and then attempts to classify the object within it. <br />
Mask R-CNN extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. This is done by using a Fully Convolutional Network as each mask branch in a pixel-by-pixel way. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.<br />
<br />
== Visual Perception Tasks == <br />
<br />
Figure 1 shows a visual representation of different types of visual perception tasks:<br />
<br />
- Image Classification: Predict a set of labels to characterize the contents of an input image<br />
<br />
- Object Detection: Localizing objects in an image by building bounding boxes around the object, then proceed to classify them according to a set of classes. The current baseline system for object detection is Fast/Faster R-CNN.<br />
<br />
- Semantic Segmentation: Associate every pixel in an input image with a class label. The common baseline system for semantic segmentation is FCN (Fully Convolutional Network).<br />
<br />
- Instance Segmentation: Associate every pixel in an input image to a specific object. Instance segmentation combines image classification, object detection and semantic segmentation making it a complex task.<br />
<br />
[[File:instance segmentation.png | center]]<br />
<div align="center">Figure 1: Visual Perception tasks</div><br />
<br />
<br />
Mask RCNN is a deep neural network architecture combining multiple state-of-art techniques for the task of Instance Segmentation.<br />
<br />
== Related Architecture to Mask RCNN == <br />
Region Proposal Network: A Region Proposal Network (RPN) proposes candidate object bounding boxes, which is the first step for effective object detection. It takes an image (of any size) as input and outputs a set of rectangular object boxes, each with an objectness score.<br />
<br />
ROI Pooling: The main use of ROI (Region of Interest) Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section. ROI pooling solves the problem of fixed image size requirement for object detection network. The entire image feeds a CNN model to detect RoI on the feature maps.<br />
<br />
[[File:ROI pooling.png | center | 300px]]<br />
<br />
<br />
Faster R-CNN: Faster R-CNN consists of two stages: Region Proposal Network and Fast R-CNN using ROI Pooling. Region Proposal Network proposes candidate object bounding boxes. ROI Pooling, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.<br />
<br />
[[File:FasterRCNN.png | center]]<br />
<div align="center">Figure 2: Faster RCNN architecture</div><br />
<br />
<br />
ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale. Other than FPN, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.<br />
<br />
[[File:ResNetFPN.png | center]]<br />
<div align="center">Figure 3: ResNetFPN architecture</div><br />
<br />
== Model Architecture == <br />
The structure of mask R-CNN is quite similar to the structure of faster R-CNN. <br />
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying the stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI as <math>L=L_{cls}+L_{box}+L_{mask}</math>, where <math>L_{cls}</math>, <math>L_{box}</math>, <math>L_{mask}</math> represent the classification loss, bounding box loss and the average binary cross-entropy loss respectively.<br />
<br />
The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, complete R-CNN pipeline stages involve 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stages 3 and 4 are adjusted to simplify the network procedures.<br />
<br />
The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.<br />
One thing worth noticing is that for other network systems, those masks across classes compete with each other. However, in this particular case with a <br />
per-pixel sigmoid and a binary loss the masks across classes no longer compete, it makes this formula the key for good instance segmentation results.<br />
<br />
=== RoIAlign ===<br />
<br />
This concept is useful in stage 2 where the RoIPool extracts features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to the pixel-to-pixel correspondence. <br />
<br />
The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. <br />
<br />
RoIPool is standard for extracting a small feature map from each RoI. However, it performs quantization before subdividing into spatial bins which are further quantized. Quantization produces misalignments when it comes to predicting pixel accurate masks. Therefore, instead of quantization, the coordinates are computed using bilinear interpolation They use bilinear interpolation to get the exact values of the inputs features at the 4 RoI bins and aggregate the result (using max or average). These results are robust to the sampling location and number of points and to guarantee spatial correspondence.<br />
<br />
The network architectures utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction. <br />
<br />
Some implementation details should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.<br />
<br />
== Results ==<br />
[[File:ExpInstanceSeg.png | center]]<br />
<div align="center">Figure 4: Instance Segmentation Experiments</div><br />
<br />
Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of the art model <br />
<br />
[[File:BoundingBoxExp.png | center]]<br />
<div align="center">Figure 5: Bounding Box Detection Experiments</div><br />
<br />
Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.<br />
<br />
== Ablation Experiments ==<br />
[[File:BackboneExp.png | center]]<br />
<div align="center">Figure 6: Backbone Architecture Experiments</div><br />
<br />
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. <br />
<br />
[[File:MultiVSInde.png | center]]<br />
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div><br />
<br />
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).<br />
<br />
[[File: RoIAlign.png | center]]<br />
<div align="center">Figure 8: RoIAlign Experiments 1</div><br />
<br />
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers. <br />
<br />
[[File: RoIAlignExp.png | center]]<br />
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div><br />
<br />
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.<br />
<br />
[[File:MaskBranchExp.png | center]]<br />
<div align="center">Figure 10: Mask Branch Experiments</div><br />
<br />
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.<br />
<br />
== Human Pose Estimation ==<br />
Mask RCNN can be extended to human pose estimation.<br />
<br />
The simple approach the paper presents is to model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types such as left shoulder, right elbow. The model has minimal knowledge of human pose and this example illustrates the generality of the model.<br />
<br />
[[File:HumanPose.png | center]]<br />
<div align="center">Figure 11: Keypoint Detection Results</div><br />
<br />
== Experiments on Cityscapes ==<br />
The model was also tested on Cityscapes dataset. From this dataset the authors used 2975 annotated images for training, 500 for validation, and 1525 for testing. The instance segmentation task involved eight categories: person, rider, car, truck, bus, train, motorcycle and bicycle. When the Mask R-CNN model was applied to the data it achieved 26.2 AP on the testing data which was an over 30% improvement on the previous best entry. <br />
<br />
<center><br />
[[ File:cityscapeDataset.png ]]<br />
<br />
<br />
Figure 12. Cityscapes Results<br />
</center><br />
<br />
== Conclusion ==<br />
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.<br />
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.<br />
<br />
== Critiques ==<br />
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.<br />
<br />
It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.<br />
<br />
The paper lacks the comparisons of different methods and Mask RNN on unlabeled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.<br />
<br />
The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.<br />
<br />
The Mask RCNN could be a candidate model to do short-term predictions on the physical behaviors of a person, which could be very useful at crime scenes.<br />
<br />
For the most part, instance segmentation is now quite achievable, and it’s time to start thinking about innovative ways of using this idea of doing computer vision algorithms at a pixel by pixel level such as the DensePose algorithm. <br />
<br />
An interesting application of Mask RCNN would be on face recognition from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.<br />
<br />
The main problem for CNN architectures like Mask RCNN is the running time. Due to slow running times, Single Shot Detector algorithms are preferred for applications like video or live stream detections, where a faster running time would mean a better response to changes in frames. It would be beneficial to have a graphical representation of the Mask RCNN running times against single shot detector algorithms such as YOLOv3.<br />
<br />
It is interesting to investigate a solution of embedding instance segmentation with semantic segmentation to improve time performance. Because in many situations, knowing the exact boundary of an object is not necessary.<br />
<br />
<br />
It will be better if we can have more comparisons with other models. It will also be nice if we can have more details about why Mask RCNN can perform better, and how about the efficiency of it?<br />
The authors mentioned that Mask R-CNN is a deep neural network architecture for Instance Segmentation. It's better to include more background information about this task. For example, challenges of this task (e.g. the model will need to take into account the overlapping of objects) and limitations of existing methods.<br />
<br />
It would be interesting to see how a postprocessing step with conditional random fields (CRF) might improve (or not?) segmentation. It would also have been interesting to see the performance of the method with lighter backbones since the backbones used to have a very large inference time which makes them unsuitable for many applications.<br />
<br />
An extension of the application of Mask RCNN in medical AI is to highlight areas of an MRI scan that correlate to certain behavioral/psychological patterns.<br />
<br />
The use of these in medical imaging systems seems rather useful, but it can also be extended to more general CCTV camera systems which can also detect physiological patterns.<br />
<br />
In the Human Pose Estimation section, we assume that Mask RCNN does not have any knowledge of human poses, and all the predictions are based on keypoints on human bodies, for example, left shoulder and right elbow. While in fact we may be able to achieve better performances here because currently this approach is strongly dependent on correct classifications of human body parts. That is, if the model messed up the position of left shoulder, the position estimation will be awful. It is important to remove the dependency on preceding predictions, so that even when previous steps fail, we may still expect a fair performance.<br />
<br />
It will be interesting to see if applying dropout can boost this Mask RCNN architecture's performance.<br />
<br />
It will be interesting if mask RCNN is applied to human faces and how it classify each individual also would be nice to see how the technical calculations such as classification and predictions are done.<br />
<br />
It would be interesting to know how the RCNN model will perform on unbalanced data and how the performance compares with other models in this circumstances.<br />
<br />
The authors omitted the details of the training and the computational cost of training the model. Since RCNN combines stages 3 and 4 (SVM to categorize and bounding box regression), how does this affect the computational cost of the model? Similar architectures to the RCNN have long training times so it is of interest to know the computational runtime of this model in comparison to other models.<br />
<br />
It's amazing what these researchers were able to achieve with adding minimal overhead, and how well it generalizes using two completely different datasets. For the future work it would be nice to see if the model is able to also predict the distance between the objects that overlap in an image, without adding any further significant overhead. <br />
<br />
Additionally, it would be nice to see how well the model is able to predict collision detection between the objects given that it is currently at 5 frames-per-second (which is still really impressive, it just would be interesting to see how much would be possible)<br />
<br />
== Interesting Directions ==<br />
<br />
There is recent work on ResNeSt: Split-Attention Networks (https://arxiv.org/abs/2004.08955), which uses an explicit soft attention mechanism over channels within a ResNeXt style architecture which shows improvements to classification. It would be interesting to use this backbone with Mask R-CNN and see if the attention helps capture longer range dependencies and thus produce better segmentations.<br />
<br />
== References ==<br />
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.<br />
<br />
[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.<br />
<br />
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Loss_Function_Search_for_Face_Recognition&diff=49696Loss Function Search for Face Recognition2020-12-07T03:31:23Z<p>X277lu: /* Introduction */</p>
<hr />
<div>== Presented by ==<br />
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang<br />
<br />
== Introduction ==<br />
Face recognition is a technology that can label a face to a specific identity. The field of study involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face image and another face image map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. A discriminative feature is one that is able to successfully discriminate the labeled data, and is typically a result of feature engineering/selection. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness. Hence, the paper introduced a new loss function using a scale parameter to produce higher gradients to well-separated samples which can reduce the softmax probability. <br />
<br />
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.<br />
<br />
'''Soft Max'''<br />
<br />
Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative value of target values times the log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:<br />
<br />
<center><math>L_1=-\log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center><br />
<br />
<br />
Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:<br />
<br />
<center><math>L_2=-\log\frac{e^{s \cos{(\theta_{{w_y},x})}}}{e^{s \cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}}</math> [1] </center><br />
<br />
Where <math> \cos{(\theta_{{w_k},x})} = w^T_y </math> is cosine similarity and <math>\theta_{{w_k},x}</math> is angle between <math> w_k</math> and x. The learnt features with this soft max loss are prone to be separable (as desired).<br />
<br />
'''Margin-based Softmax'''<br />
<br />
This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, the functions are built upon the same structure as the equation above.<br />
<br />
The margin-based softmax function is:<br />
<br />
<center><math>L_3=-\log\frac{e^{s f{(m,\theta_{{w_y},x})}}}{e^{s f{(m,\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}} </math> </center><br />
<br />
Here, <math>f{(m,\theta_{{w_y},x})} \leq \cos (\theta_{w_y,x})</math> is a carefully chosen margin function.<br />
<br />
Some other variations of chosen functions:<br />
<br />
'''A-Softmax Loss: ''' <math>f{(m_1,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x})</math> , where m1 >= 1 and a integer.<br />
<br />
'''Arc-Softmax Loss: ''' <math>f{(m_1,\theta_{{w_y},x})} = \cos (\theta_{w_y,x} + m_2)</math>, where m2 > 0<br />
<br />
'''AM-Softmax Loss: ''' <math>f{(m,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x} + m_2) - m_3</math>, where m1 >= 1 and a integer; m2,m3 > 0<br />
<br />
<br />
<br />
In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.<br />
<br />
== Motivation ==<br />
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions is high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed including margin-based formulations which often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.<br />
<br />
To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one required parameter and an improved search space using a reward-based method was determined by the authors to be the best option for their loss function.<br />
<br />
== Problem Formulation ==<br />
=== Analysis of Margin-based Softmax Loss ===<br />
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:<br />
<br />
<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center><br />
<center> where <math>a=1-e^{s\,{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center><br />
<br />
<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.<br />
<br />
Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability.<br />
<br />
=== Random Search ===<br />
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:<br />
<br />
<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center><br />
<br />
This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.<br />
<br />
=== Reward-Guided Search ===<br />
Random search has no guidance for training. To solve this, the authors use reinforcement learning. Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is: <br />
<br />
<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center><br />
<br />
where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.<br />
<br />
<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment's reaction to this action.<br />
<br />
<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center><br />
<center>Figure 1: Reinforcement Learning scenario [4]</center><br />
<br />
The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5]. <br />
<br />
In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. At each epoch, <math>B</math> hyper-parameters <math>{a_1, a_2, ..., a_B }</math> are sampled as <math>a \sim \mathcal{N}(\mu, \sigma)</math>. In each epoch, <math>B</math> models are generated with rewards <math>R(a_i), i \in [1, B]</math>. <math>\mu</math> updates after each epoch from the reward function. <br />
<br />
<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center><br />
<br />
Where <math>{g(a_i; \mu, \sigma})</math> is the PDF of a Gaussian distribution. The distributions of <math>{a}</math> are updated and the best model if found from the <math>{B}</math> candidates for the next epoch.<br />
<br />
=== Optimization ===<br />
Calculating the reward involves a standard bi-level optimization problem. A standard bi-level optimization problem is a hierarchy of two optimization tasks, an upper-level or leader and lower-level or follower problems, which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:<br />
<br />
<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center><br />
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center><br />
<br />
In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.<br />
<br />
== Results and Discussion ==<br />
=== Data Preprocessing ===<br />
The training datasets consisted of cleaned versions of CASIA-WebFace and MS-Celeb-1M-v1c to remove the impact of noisy labels in the original sets.<br />
Furthermore, it is important to perform open-set evaluation for face recognition problems. That is, there shall be no overlapping identities between training and testing sets. As a result, there were a total of 15,414 identities removed from the testing sets. For fairness during comparison, all summarized results will be based on refined datasets.<br />
<br />
=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===<br />
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms, however, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.<br />
<br />
Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boosts the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discrimination power. Also, the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation. The following table gives a summary of the performance of each model.<br />
<br />
<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<br />
<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on RFW ===<br />
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3. <br />
<br />
<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on MegaFace and Trillion-Pairs ===<br />
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e^{-3}</math> on MegaFace, the identification TPR@FAR = <math>1e^{-6}</math> and the verification TPR@FAR = <math>1e^{-9}</math> on Trillion-Pairs are reported on Table 4 and 5.<br />
<br />
On the test sets MegaFace and Trillion-Pairs, Search-Softmax achieves the best performance over all other alternative methods. On MegaFace, Search-Softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to newly designed search space. <br />
<br />
<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center><br />
<br />
From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a similar trend with Trillion-Pairs where Search-Softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.<br />
<br />
<center>[[Image:G25_Figure2_left.png|800px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|800px |alt=Alt text|Title text]]</center><br />
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center><br />
<br />
== Conclusion ==<br />
The paper discussed that in order to enhance feature discrimination for face recognition, it is crucial to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets. While these developed methods were generally more effective in increasing accuracy versus previous methods, there is very little difference between the two. It can be seen that Search-Softmax performs slightly better than Random-Softmax most of the time.<br />
<br />
== Critiques ==<br />
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.<br />
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.<br />
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.<br />
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.<br />
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.<br />
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.<br />
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper], written by Tae-Hyun Oh et al., they also discuss an optimal loss function for face recognition. However, since in the other paper, they were doing face recognition from voice audio, the loss function used was slightly different than the ones discussed in this paper.<br />
* This model has many applications such as identifying disguised prisoners for police. But we need to do a good data preprocessing otherwise we might not get a good predicted result. But authors did not mention about the data preprocessing which is a key part of this model. It might be done in either feature extraction or feature selection, but it's crucial to indicate how they did it.<br />
* It will be better if we can know what kind of noises was removed in the clean version. Also, simply removing the overlapping data is wasteful. It would be better to just put them into one of the train and test samples.<br />
* This paper indicate that the new searching method and loss function have induced more effective face recognition result than other six methods. But there is no mention of the increase or decrease in computational efficiency since only very little difference exist between those methods and the real time evaluation is often required at the face recognition application level.<br />
* There are some loss functions that receives more than 2 inputs. For example, the ''triplet loss'' function, developed by Google, takes 3 inputs: positive input, negative input and anchor input. This makes sense because for face recognition, we want to model to learn not only what it is supposed to predict but also what it is not supposed to predict. Typically, triplet loss handles false positives much better. This paper can extend its scope to such loss function that takes more than 2 inputs.<br />
* It would be good to also know what the training time is like for the method, specifically the "Reward-Guided Search" which uses RL. Also the authors mention some data preprocessing that was performed, was this same preprocessing also performed for the methods they compared against?<br />
* Sections on Data Processing and Results can be improved. About the datasets, I have some questions about why they are divided in the current fashion. It is mentioned that "CASIA-WebFace and MS-Celeb-1M-v1c" are used as training datasets. But the comparison of algorithms are divided into three groups: Megaface and TrillionPairs, RFW, and a group of other datasets. In general, when we are comparing algorithms, we want to have a holistic view of how each algorithm compare. So I have some concerns about dividing the results into three section. More explanation can be provided. It also seems like Random-Softmax and Search Softmax outperform all other algorithms across all datasets. So it would make even more sense to have a big table including all the results. About data preprocessing, I believe that giving more information about which noisy data are removed would be nice.<br />
* Despite thorough comparison between each method against the proposed method, it does not give a reason to why it was the case that it was either better or worse, and it does not necessarily need to be a mathematical explanation but an intuitive one to demonstrate how it can be replicated and whether the results require a certain condition to achieve. <br />
* Though we have a graph demonstrating the training loss with Random-Softmax and Search-Softmax with regards to the number of Epochs as an independent variable which we may deduce the number of epochs used in later graphs but since one of the main features is that "Meanwhile, our optimization strategy enables that the dynamic loss can guide<br />
* Did the paper address why the average model performs worse on African faces, would it be a lack of data points?<br />
the model training of different epochs, which helps further boost the discrimination power." it is imperative that the results are comparable along the same scale (for example, for 20 epochs, then take the average of the losses).<br />
* The result summary is overwhelming with numbers and representation of result is lacking. It would be great if the result can be explained. Introduction of model and its component is lacking and could be explained more.<br />
* It would be better if the paper contains some Face Recognition visualization, i.e. show actually face recognition example to show the improvement.<br />
* The introduction of data and the analysis of data processing are important because there might be some limitations. Also, it would be better to give theoretical analysis of the effects of reducing softmax probability and the number of sampled models, which explains the update of the parameters for better performance.<br />
* It would be better to include time performance in the evaluation section.<br />
* The paper is missing details on datasets. It would be better to know if the datasets were balanced or unbalanced and how this would affect the accuracy. Also, computational comparisons between the new loss function versus traditional method would be interesting to know.<br />
* The paper included a dataset that measures racial bias, however it is a widely known fact that majority of face recognition models are trained on biased and imbalanced datasets themselves. For example, AI that has bias towards classifying a black person as a prisoner since the training set of prisoners is predominantly black. A question that remains unanswered is how training a model using the proposed loss function helps to combat racial bias in machine learning, and how these results in particular improved (or worsened) with its use.<br />
* There are too much data in the conclusion part. A brief conclusion based on several sentences should be enough to present the ideas.<br />
* The author could add the time efficiency of fave recognition in the result to compare the models with other current models for facial recognition since nowadays many application that uses face recognition rely on fast recognition(e.g. unlock phone with face id)<br />
*It is interesting to see how loss function impacts face recognition. It would be better to see different evaluations based on different datasets. Not only accuracy is important, but also efficiency is significant.<br />
* It is important for facial recognition models to feature some level of noise, since for most uses, things like lighting and camera quality is drastically different. For a more practical use of this, datasets with a certain level of noise should be introduced and tested.<br />
* The paper omits the full details of the necessity of normalization within the model; as described in the original [https://arxiv.org/abs/1801.05599 paper] describing A-softmax for face verification they stressed the importance of normalization. Does the lack of detail mean these new softmax loss functions don't require layer normalization?<br />
<br />
== References ==<br />
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.<br />
<br />
[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.<br />
2020].<br />
<br />
[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020]. <br />
<br />
[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].<br />
<br />
[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Loss_Function_Search_for_Face_Recognition&diff=49695Loss Function Search for Face Recognition2020-12-07T03:30:32Z<p>X277lu: /* Introduction */</p>
<hr />
<div>== Presented by ==<br />
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang<br />
<br />
== Introduction ==<br />
Face recognition is a technology that can label a face to a specific identity. The field of study involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face image and another face image map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. A discriminative feature is one that is able to successfully discriminate the labeled data, and is typically a result of feature engineering/selection. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness. Hence, the paper introduced a new loss function using a scale parameter to produce higher gradients to well-separated samples which can reduce the softmax probability. <br />
<br />
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.<br />
<br />
'''Soft Max'''<br />
<br />
Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative value of target values times the log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:<br />
<br />
<center><math>L_1=-\log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center><br />
<br />
<br />
Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:<br />
<br />
<center><math>L_2=-\log\frac{e^{s \cos{(\theta_{{w_y},x})}}}{e^{s \cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}}</math> [1] </center><br />
<br />
Where <math> \cos{(\theta_{{w_k},x})} = w^T_y </math> is cosine similarity and <math>\theta_{{w_k},x}</math> is angle between <math> w_k</math> and x. The learnt features with this soft max loss are prone to be separable (as desired).<br />
<br />
'''Margin-based Softmax'''<br />
<br />
This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, they build upon the same structure as the equation above.<br />
<br />
The margin-based softmax function is:<br />
<br />
<center><math>L_3=-\log\frac{e^{s f{(m,\theta_{{w_y},x})}}}{e^{s f{(m,\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}} </math> </center><br />
<br />
Here, <math>f{(m,\theta_{{w_y},x})} \leq \cos (\theta_{w_y,x})</math> is a carefully chosen margin function.<br />
<br />
Some other variations of chosen functions:<br />
<br />
'''A-Softmax Loss: ''' <math>f{(m_1,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x})</math> , where m1 >= 1 and a integer.<br />
<br />
'''Arc-Softmax Loss: ''' <math>f{(m_1,\theta_{{w_y},x})} = \cos (\theta_{w_y,x} + m_2)</math>, where m2 > 0<br />
<br />
'''AM-Softmax Loss: ''' <math>f{(m,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x} + m_2) - m_3</math>, where m1 >= 1 and a integer; m2,m3 > 0<br />
<br />
<br />
<br />
In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.<br />
<br />
== Motivation ==<br />
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions is high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed including margin-based formulations which often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.<br />
<br />
To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one required parameter and an improved search space using a reward-based method was determined by the authors to be the best option for their loss function.<br />
<br />
== Problem Formulation ==<br />
=== Analysis of Margin-based Softmax Loss ===<br />
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:<br />
<br />
<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center><br />
<center> where <math>a=1-e^{s\,{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center><br />
<br />
<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.<br />
<br />
Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability.<br />
<br />
=== Random Search ===<br />
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:<br />
<br />
<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center><br />
<br />
This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.<br />
<br />
=== Reward-Guided Search ===<br />
Random search has no guidance for training. To solve this, the authors use reinforcement learning. Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is: <br />
<br />
<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center><br />
<br />
where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.<br />
<br />
<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment's reaction to this action.<br />
<br />
<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center><br />
<center>Figure 1: Reinforcement Learning scenario [4]</center><br />
<br />
The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5]. <br />
<br />
In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. At each epoch, <math>B</math> hyper-parameters <math>{a_1, a_2, ..., a_B }</math> are sampled as <math>a \sim \mathcal{N}(\mu, \sigma)</math>. In each epoch, <math>B</math> models are generated with rewards <math>R(a_i), i \in [1, B]</math>. <math>\mu</math> updates after each epoch from the reward function. <br />
<br />
<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center><br />
<br />
Where <math>{g(a_i; \mu, \sigma})</math> is the PDF of a Gaussian distribution. The distributions of <math>{a}</math> are updated and the best model if found from the <math>{B}</math> candidates for the next epoch.<br />
<br />
=== Optimization ===<br />
Calculating the reward involves a standard bi-level optimization problem. A standard bi-level optimization problem is a hierarchy of two optimization tasks, an upper-level or leader and lower-level or follower problems, which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:<br />
<br />
<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center><br />
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center><br />
<br />
In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.<br />
<br />
== Results and Discussion ==<br />
=== Data Preprocessing ===<br />
The training datasets consisted of cleaned versions of CASIA-WebFace and MS-Celeb-1M-v1c to remove the impact of noisy labels in the original sets.<br />
Furthermore, it is important to perform open-set evaluation for face recognition problems. That is, there shall be no overlapping identities between training and testing sets. As a result, there were a total of 15,414 identities removed from the testing sets. For fairness during comparison, all summarized results will be based on refined datasets.<br />
<br />
=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===<br />
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms, however, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.<br />
<br />
Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boosts the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discrimination power. Also, the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation. The following table gives a summary of the performance of each model.<br />
<br />
<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<br />
<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on RFW ===<br />
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3. <br />
<br />
<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on MegaFace and Trillion-Pairs ===<br />
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e^{-3}</math> on MegaFace, the identification TPR@FAR = <math>1e^{-6}</math> and the verification TPR@FAR = <math>1e^{-9}</math> on Trillion-Pairs are reported on Table 4 and 5.<br />
<br />
On the test sets MegaFace and Trillion-Pairs, Search-Softmax achieves the best performance over all other alternative methods. On MegaFace, Search-Softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to newly designed search space. <br />
<br />
<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center><br />
<br />
From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a similar trend with Trillion-Pairs where Search-Softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.<br />
<br />
<center>[[Image:G25_Figure2_left.png|800px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|800px |alt=Alt text|Title text]]</center><br />
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center><br />
<br />
== Conclusion ==<br />
The paper discussed that in order to enhance feature discrimination for face recognition, it is crucial to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets. While these developed methods were generally more effective in increasing accuracy versus previous methods, there is very little difference between the two. It can be seen that Search-Softmax performs slightly better than Random-Softmax most of the time.<br />
<br />
== Critiques ==<br />
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.<br />
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.<br />
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.<br />
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.<br />
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.<br />
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.<br />
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper], written by Tae-Hyun Oh et al., they also discuss an optimal loss function for face recognition. However, since in the other paper, they were doing face recognition from voice audio, the loss function used was slightly different than the ones discussed in this paper.<br />
* This model has many applications such as identifying disguised prisoners for police. But we need to do a good data preprocessing otherwise we might not get a good predicted result. But authors did not mention about the data preprocessing which is a key part of this model. It might be done in either feature extraction or feature selection, but it's crucial to indicate how they did it.<br />
* It will be better if we can know what kind of noises was removed in the clean version. Also, simply removing the overlapping data is wasteful. It would be better to just put them into one of the train and test samples.<br />
* This paper indicate that the new searching method and loss function have induced more effective face recognition result than other six methods. But there is no mention of the increase or decrease in computational efficiency since only very little difference exist between those methods and the real time evaluation is often required at the face recognition application level.<br />
* There are some loss functions that receives more than 2 inputs. For example, the ''triplet loss'' function, developed by Google, takes 3 inputs: positive input, negative input and anchor input. This makes sense because for face recognition, we want to model to learn not only what it is supposed to predict but also what it is not supposed to predict. Typically, triplet loss handles false positives much better. This paper can extend its scope to such loss function that takes more than 2 inputs.<br />
* It would be good to also know what the training time is like for the method, specifically the "Reward-Guided Search" which uses RL. Also the authors mention some data preprocessing that was performed, was this same preprocessing also performed for the methods they compared against?<br />
* Sections on Data Processing and Results can be improved. About the datasets, I have some questions about why they are divided in the current fashion. It is mentioned that "CASIA-WebFace and MS-Celeb-1M-v1c" are used as training datasets. But the comparison of algorithms are divided into three groups: Megaface and TrillionPairs, RFW, and a group of other datasets. In general, when we are comparing algorithms, we want to have a holistic view of how each algorithm compare. So I have some concerns about dividing the results into three section. More explanation can be provided. It also seems like Random-Softmax and Search Softmax outperform all other algorithms across all datasets. So it would make even more sense to have a big table including all the results. About data preprocessing, I believe that giving more information about which noisy data are removed would be nice.<br />
* Despite thorough comparison between each method against the proposed method, it does not give a reason to why it was the case that it was either better or worse, and it does not necessarily need to be a mathematical explanation but an intuitive one to demonstrate how it can be replicated and whether the results require a certain condition to achieve. <br />
* Though we have a graph demonstrating the training loss with Random-Softmax and Search-Softmax with regards to the number of Epochs as an independent variable which we may deduce the number of epochs used in later graphs but since one of the main features is that "Meanwhile, our optimization strategy enables that the dynamic loss can guide<br />
* Did the paper address why the average model performs worse on African faces, would it be a lack of data points?<br />
the model training of different epochs, which helps further boost the discrimination power." it is imperative that the results are comparable along the same scale (for example, for 20 epochs, then take the average of the losses).<br />
* The result summary is overwhelming with numbers and representation of result is lacking. It would be great if the result can be explained. Introduction of model and its component is lacking and could be explained more.<br />
* It would be better if the paper contains some Face Recognition visualization, i.e. show actually face recognition example to show the improvement.<br />
* The introduction of data and the analysis of data processing are important because there might be some limitations. Also, it would be better to give theoretical analysis of the effects of reducing softmax probability and the number of sampled models, which explains the update of the parameters for better performance.<br />
* It would be better to include time performance in the evaluation section.<br />
* The paper is missing details on datasets. It would be better to know if the datasets were balanced or unbalanced and how this would affect the accuracy. Also, computational comparisons between the new loss function versus traditional method would be interesting to know.<br />
* The paper included a dataset that measures racial bias, however it is a widely known fact that majority of face recognition models are trained on biased and imbalanced datasets themselves. For example, AI that has bias towards classifying a black person as a prisoner since the training set of prisoners is predominantly black. A question that remains unanswered is how training a model using the proposed loss function helps to combat racial bias in machine learning, and how these results in particular improved (or worsened) with its use.<br />
* There are too much data in the conclusion part. A brief conclusion based on several sentences should be enough to present the ideas.<br />
* The author could add the time efficiency of fave recognition in the result to compare the models with other current models for facial recognition since nowadays many application that uses face recognition rely on fast recognition(e.g. unlock phone with face id)<br />
*It is interesting to see how loss function impacts face recognition. It would be better to see different evaluations based on different datasets. Not only accuracy is important, but also efficiency is significant.<br />
* It is important for facial recognition models to feature some level of noise, since for most uses, things like lighting and camera quality is drastically different. For a more practical use of this, datasets with a certain level of noise should be introduced and tested.<br />
* The paper omits the full details of the necessity of normalization within the model; as described in the original [https://arxiv.org/abs/1801.05599 paper] describing A-softmax for face verification they stressed the importance of normalization. Does the lack of detail mean these new softmax loss functions don't require layer normalization?<br />
<br />
== References ==<br />
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.<br />
<br />
[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.<br />
2020].<br />
<br />
[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020]. <br />
<br />
[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].<br />
<br />
[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Loss_Function_Search_for_Face_Recognition&diff=49694Loss Function Search for Face Recognition2020-12-07T03:30:14Z<p>X277lu: /* Introduction */</p>
<hr />
<div>== Presented by ==<br />
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang<br />
<br />
== Introduction ==<br />
Face recognition is a technology that can label a face to a specific identity. The field of study involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face image and another face image map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. A discriminative feature is one that is able to successfully discriminate the labeled data, and is typically a result of feature engineering/selection. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness. Hence, the paper introduced a new loss function using a scale parameter to produce higher gradients to well-separated samples which can reduce the softmax probability. <br />
<br />
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.<br />
<br />
'''Soft Max'''<br />
<br />
Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative value of target values times the log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:<br />
<br />
<center><math>L_1=-\log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center><br />
<br />
<br />
Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:<br />
<br />
<center><math>L_2=-\log\frac{e^{s \cos{(\theta_{{w_y},x})}}}{e^{s \cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}}</math> [1] </center><br />
<br />
Where <math> \cos{(\theta_{{w_k},x})} = w^T_y </math> is cosine similarity and <math>\theta_{{w_k},x}</math> is angle between <math> w_k</math> and x. The learnt features with this soft max loss are prone to be separable (as desired).<br />
<br />
'''Margin-based Softmax'''<br />
<br />
This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, they build upon the same structure as the equation above.<br />
<br />
The margin-based softmax function is:<br />
<br />
<center><math>L_3=-\log\frac{e^{s f{(m,\theta_{{w_y},x})}}}{e^{s f{(m,\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}} </math> </center><br />
<br />
Here, <math>f{(m,\theta_{{w_y},x})} \leq \cos (\theta_{w_y,x})</math> is a carefully chosen margin function.<br />
<br />
Some other variations of chosen functions:<br />
<br />
'''A-Softmax Loss: '''<math>f{(m_1,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x})</math> , where m1 >= 1 and a integer.<br />
<br />
'''Arc-Softmax Loss: '''<math>f{(m_1,\theta_{{w_y},x})} = \cos (\theta_{w_y,x} + m_2)</math>, where m2 > 0<br />
<br />
'''AM-Softmax Loss: '''<math>f{(m,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x} + m_2) - m_3</math>, where m1 >= 1 and a integer; m2,m3 > 0<br />
<br />
<br />
<br />
In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.<br />
<br />
== Motivation ==<br />
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions is high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed including margin-based formulations which often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.<br />
<br />
To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one required parameter and an improved search space using a reward-based method was determined by the authors to be the best option for their loss function.<br />
<br />
== Problem Formulation ==<br />
=== Analysis of Margin-based Softmax Loss ===<br />
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:<br />
<br />
<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center><br />
<center> where <math>a=1-e^{s\,{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center><br />
<br />
<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.<br />
<br />
Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability.<br />
<br />
=== Random Search ===<br />
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:<br />
<br />
<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center><br />
<br />
This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.<br />
<br />
=== Reward-Guided Search ===<br />
Random search has no guidance for training. To solve this, the authors use reinforcement learning. Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is: <br />
<br />
<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center><br />
<br />
where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.<br />
<br />
<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment's reaction to this action.<br />
<br />
<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center><br />
<center>Figure 1: Reinforcement Learning scenario [4]</center><br />
<br />
The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5]. <br />
<br />
In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. At each epoch, <math>B</math> hyper-parameters <math>{a_1, a_2, ..., a_B }</math> are sampled as <math>a \sim \mathcal{N}(\mu, \sigma)</math>. In each epoch, <math>B</math> models are generated with rewards <math>R(a_i), i \in [1, B]</math>. <math>\mu</math> updates after each epoch from the reward function. <br />
<br />
<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center><br />
<br />
Where <math>{g(a_i; \mu, \sigma})</math> is the PDF of a Gaussian distribution. The distributions of <math>{a}</math> are updated and the best model if found from the <math>{B}</math> candidates for the next epoch.<br />
<br />
=== Optimization ===<br />
Calculating the reward involves a standard bi-level optimization problem. A standard bi-level optimization problem is a hierarchy of two optimization tasks, an upper-level or leader and lower-level or follower problems, which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:<br />
<br />
<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center><br />
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center><br />
<br />
In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.<br />
<br />
== Results and Discussion ==<br />
=== Data Preprocessing ===<br />
The training datasets consisted of cleaned versions of CASIA-WebFace and MS-Celeb-1M-v1c to remove the impact of noisy labels in the original sets.<br />
Furthermore, it is important to perform open-set evaluation for face recognition problems. That is, there shall be no overlapping identities between training and testing sets. As a result, there were a total of 15,414 identities removed from the testing sets. For fairness during comparison, all summarized results will be based on refined datasets.<br />
<br />
=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===<br />
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms, however, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.<br />
<br />
Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boosts the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discrimination power. Also, the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation. The following table gives a summary of the performance of each model.<br />
<br />
<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<br />
<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on RFW ===<br />
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3. <br />
<br />
<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on MegaFace and Trillion-Pairs ===<br />
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e^{-3}</math> on MegaFace, the identification TPR@FAR = <math>1e^{-6}</math> and the verification TPR@FAR = <math>1e^{-9}</math> on Trillion-Pairs are reported on Table 4 and 5.<br />
<br />
On the test sets MegaFace and Trillion-Pairs, Search-Softmax achieves the best performance over all other alternative methods. On MegaFace, Search-Softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to newly designed search space. <br />
<br />
<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center><br />
<br />
From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a similar trend with Trillion-Pairs where Search-Softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.<br />
<br />
<center>[[Image:G25_Figure2_left.png|800px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|800px |alt=Alt text|Title text]]</center><br />
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center><br />
<br />
== Conclusion ==<br />
The paper discussed that in order to enhance feature discrimination for face recognition, it is crucial to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets. While these developed methods were generally more effective in increasing accuracy versus previous methods, there is very little difference between the two. It can be seen that Search-Softmax performs slightly better than Random-Softmax most of the time.<br />
<br />
== Critiques ==<br />
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.<br />
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.<br />
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.<br />
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.<br />
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.<br />
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.<br />
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper], written by Tae-Hyun Oh et al., they also discuss an optimal loss function for face recognition. However, since in the other paper, they were doing face recognition from voice audio, the loss function used was slightly different than the ones discussed in this paper.<br />
* This model has many applications such as identifying disguised prisoners for police. But we need to do a good data preprocessing otherwise we might not get a good predicted result. But authors did not mention about the data preprocessing which is a key part of this model. It might be done in either feature extraction or feature selection, but it's crucial to indicate how they did it.<br />
* It will be better if we can know what kind of noises was removed in the clean version. Also, simply removing the overlapping data is wasteful. It would be better to just put them into one of the train and test samples.<br />
* This paper indicate that the new searching method and loss function have induced more effective face recognition result than other six methods. But there is no mention of the increase or decrease in computational efficiency since only very little difference exist between those methods and the real time evaluation is often required at the face recognition application level.<br />
* There are some loss functions that receives more than 2 inputs. For example, the ''triplet loss'' function, developed by Google, takes 3 inputs: positive input, negative input and anchor input. This makes sense because for face recognition, we want to model to learn not only what it is supposed to predict but also what it is not supposed to predict. Typically, triplet loss handles false positives much better. This paper can extend its scope to such loss function that takes more than 2 inputs.<br />
* It would be good to also know what the training time is like for the method, specifically the "Reward-Guided Search" which uses RL. Also the authors mention some data preprocessing that was performed, was this same preprocessing also performed for the methods they compared against?<br />
* Sections on Data Processing and Results can be improved. About the datasets, I have some questions about why they are divided in the current fashion. It is mentioned that "CASIA-WebFace and MS-Celeb-1M-v1c" are used as training datasets. But the comparison of algorithms are divided into three groups: Megaface and TrillionPairs, RFW, and a group of other datasets. In general, when we are comparing algorithms, we want to have a holistic view of how each algorithm compare. So I have some concerns about dividing the results into three section. More explanation can be provided. It also seems like Random-Softmax and Search Softmax outperform all other algorithms across all datasets. So it would make even more sense to have a big table including all the results. About data preprocessing, I believe that giving more information about which noisy data are removed would be nice.<br />
* Despite thorough comparison between each method against the proposed method, it does not give a reason to why it was the case that it was either better or worse, and it does not necessarily need to be a mathematical explanation but an intuitive one to demonstrate how it can be replicated and whether the results require a certain condition to achieve. <br />
* Though we have a graph demonstrating the training loss with Random-Softmax and Search-Softmax with regards to the number of Epochs as an independent variable which we may deduce the number of epochs used in later graphs but since one of the main features is that "Meanwhile, our optimization strategy enables that the dynamic loss can guide<br />
* Did the paper address why the average model performs worse on African faces, would it be a lack of data points?<br />
the model training of different epochs, which helps further boost the discrimination power." it is imperative that the results are comparable along the same scale (for example, for 20 epochs, then take the average of the losses).<br />
* The result summary is overwhelming with numbers and representation of result is lacking. It would be great if the result can be explained. Introduction of model and its component is lacking and could be explained more.<br />
* It would be better if the paper contains some Face Recognition visualization, i.e. show actually face recognition example to show the improvement.<br />
* The introduction of data and the analysis of data processing are important because there might be some limitations. Also, it would be better to give theoretical analysis of the effects of reducing softmax probability and the number of sampled models, which explains the update of the parameters for better performance.<br />
* It would be better to include time performance in the evaluation section.<br />
* The paper is missing details on datasets. It would be better to know if the datasets were balanced or unbalanced and how this would affect the accuracy. Also, computational comparisons between the new loss function versus traditional method would be interesting to know.<br />
* The paper included a dataset that measures racial bias, however it is a widely known fact that majority of face recognition models are trained on biased and imbalanced datasets themselves. For example, AI that has bias towards classifying a black person as a prisoner since the training set of prisoners is predominantly black. A question that remains unanswered is how training a model using the proposed loss function helps to combat racial bias in machine learning, and how these results in particular improved (or worsened) with its use.<br />
* There are too much data in the conclusion part. A brief conclusion based on several sentences should be enough to present the ideas.<br />
* The author could add the time efficiency of fave recognition in the result to compare the models with other current models for facial recognition since nowadays many application that uses face recognition rely on fast recognition(e.g. unlock phone with face id)<br />
*It is interesting to see how loss function impacts face recognition. It would be better to see different evaluations based on different datasets. Not only accuracy is important, but also efficiency is significant.<br />
* It is important for facial recognition models to feature some level of noise, since for most uses, things like lighting and camera quality is drastically different. For a more practical use of this, datasets with a certain level of noise should be introduced and tested.<br />
* The paper omits the full details of the necessity of normalization within the model; as described in the original [https://arxiv.org/abs/1801.05599 paper] describing A-softmax for face verification they stressed the importance of normalization. Does the lack of detail mean these new softmax loss functions don't require layer normalization?<br />
<br />
== References ==<br />
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.<br />
<br />
[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.<br />
2020].<br />
<br />
[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020]. <br />
<br />
[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].<br />
<br />
[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Loss_Function_Search_for_Face_Recognition&diff=49693Loss Function Search for Face Recognition2020-12-07T03:29:40Z<p>X277lu: /* Critiques */</p>
<hr />
<div>== Presented by ==<br />
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang<br />
<br />
== Introduction ==<br />
Face recognition is a technology that can label a face to a specific identity. The field of study involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face image and another face image map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. A discriminative feature is one that is able to successfully discriminate the labeled data, and is typically a result of feature engineering/selection. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness. Hence, the paper introduced a new loss function using a scale parameter to produce higher gradients to well-separated samples which can reduce the softmax probability. <br />
<br />
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.<br />
<br />
'''Soft Max'''<br />
<br />
Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative value of target values times the log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:<br />
<br />
<center><math>L_1=-\log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center><br />
<br />
<br />
Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:<br />
<br />
<center><math>L_2=-\log\frac{e^{s \cos{(\theta_{{w_y},x})}}}{e^{s \cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}}</math> [1] </center><br />
<br />
Where <math> \cos{(\theta_{{w_k},x})} = w^T_y </math> is cosine similarity and <math>\theta_{{w_k},x}</math> is angle between <math> w_k</math> and x. The learnt features with this soft max loss are prone to be separable (as desired).<br />
<br />
'''Margin-based Softmax'''<br />
<br />
This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, they build upon the same structure as the equation above.<br />
<br />
The margin-based softmax function is:<br />
<br />
<center><math>L_3=-\log\frac{e^{s f{(m,\theta_{{w_y},x})}}}{e^{s f{(m,\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}} </math> </center><br />
<br />
Here, <math>f{(m,\theta_{{w_y},x})} \leq \cos (\theta_{w_y,x})</math> is a carefully chosen margin function.<br />
<br />
Some other variations of chosen functions:<br />
<br />
'''A-Softmax Loss:''' <math>f{(m_1,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x})</math> , where m1 >= 1 and a integer.<br />
<br />
'''Arc-Softmax Loss:'''<math>f{(m_1,\theta_{{w_y},x})} = \cos (\theta_{w_y,x} + m_2)</math>, where m2 > 0<br />
<br />
'''AM-Softmax Loss:'''<math>f{(m,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x} + m_2) - m_3</math>, where m1 >= 1 and a integer; m2,m3 > 0<br />
<br />
<br />
<br />
In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.<br />
<br />
== Motivation ==<br />
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions is high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed including margin-based formulations which often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.<br />
<br />
To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one required parameter and an improved search space using a reward-based method was determined by the authors to be the best option for their loss function.<br />
<br />
== Problem Formulation ==<br />
=== Analysis of Margin-based Softmax Loss ===<br />
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:<br />
<br />
<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center><br />
<center> where <math>a=1-e^{s\,{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center><br />
<br />
<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.<br />
<br />
Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability.<br />
<br />
=== Random Search ===<br />
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:<br />
<br />
<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center><br />
<br />
This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.<br />
<br />
=== Reward-Guided Search ===<br />
Random search has no guidance for training. To solve this, the authors use reinforcement learning. Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is: <br />
<br />
<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center><br />
<br />
where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.<br />
<br />
<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment's reaction to this action.<br />
<br />
<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center><br />
<center>Figure 1: Reinforcement Learning scenario [4]</center><br />
<br />
The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5]. <br />
<br />
In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. At each epoch, <math>B</math> hyper-parameters <math>{a_1, a_2, ..., a_B }</math> are sampled as <math>a \sim \mathcal{N}(\mu, \sigma)</math>. In each epoch, <math>B</math> models are generated with rewards <math>R(a_i), i \in [1, B]</math>. <math>\mu</math> updates after each epoch from the reward function. <br />
<br />
<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center><br />
<br />
Where <math>{g(a_i; \mu, \sigma})</math> is the PDF of a Gaussian distribution. The distributions of <math>{a}</math> are updated and the best model if found from the <math>{B}</math> candidates for the next epoch.<br />
<br />
=== Optimization ===<br />
Calculating the reward involves a standard bi-level optimization problem. A standard bi-level optimization problem is a hierarchy of two optimization tasks, an upper-level or leader and lower-level or follower problems, which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:<br />
<br />
<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center><br />
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center><br />
<br />
In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.<br />
<br />
== Results and Discussion ==<br />
=== Data Preprocessing ===<br />
The training datasets consisted of cleaned versions of CASIA-WebFace and MS-Celeb-1M-v1c to remove the impact of noisy labels in the original sets.<br />
Furthermore, it is important to perform open-set evaluation for face recognition problems. That is, there shall be no overlapping identities between training and testing sets. As a result, there were a total of 15,414 identities removed from the testing sets. For fairness during comparison, all summarized results will be based on refined datasets.<br />
<br />
=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===<br />
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms, however, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.<br />
<br />
Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boosts the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discrimination power. Also, the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation. The following table gives a summary of the performance of each model.<br />
<br />
<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<br />
<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on RFW ===<br />
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3. <br />
<br />
<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on MegaFace and Trillion-Pairs ===<br />
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e^{-3}</math> on MegaFace, the identification TPR@FAR = <math>1e^{-6}</math> and the verification TPR@FAR = <math>1e^{-9}</math> on Trillion-Pairs are reported on Table 4 and 5.<br />
<br />
On the test sets MegaFace and Trillion-Pairs, Search-Softmax achieves the best performance over all other alternative methods. On MegaFace, Search-Softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to newly designed search space. <br />
<br />
<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center><br />
<br />
From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a similar trend with Trillion-Pairs where Search-Softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.<br />
<br />
<center>[[Image:G25_Figure2_left.png|800px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|800px |alt=Alt text|Title text]]</center><br />
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center><br />
<br />
== Conclusion ==<br />
The paper discussed that in order to enhance feature discrimination for face recognition, it is crucial to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets. While these developed methods were generally more effective in increasing accuracy versus previous methods, there is very little difference between the two. It can be seen that Search-Softmax performs slightly better than Random-Softmax most of the time.<br />
<br />
== Critiques ==<br />
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.<br />
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.<br />
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.<br />
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.<br />
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.<br />
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.<br />
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper], written by Tae-Hyun Oh et al., they also discuss an optimal loss function for face recognition. However, since in the other paper, they were doing face recognition from voice audio, the loss function used was slightly different than the ones discussed in this paper.<br />
* This model has many applications such as identifying disguised prisoners for police. But we need to do a good data preprocessing otherwise we might not get a good predicted result. But authors did not mention about the data preprocessing which is a key part of this model. It might be done in either feature extraction or feature selection, but it's crucial to indicate how they did it.<br />
* It will be better if we can know what kind of noises was removed in the clean version. Also, simply removing the overlapping data is wasteful. It would be better to just put them into one of the train and test samples.<br />
* This paper indicate that the new searching method and loss function have induced more effective face recognition result than other six methods. But there is no mention of the increase or decrease in computational efficiency since only very little difference exist between those methods and the real time evaluation is often required at the face recognition application level.<br />
* There are some loss functions that receives more than 2 inputs. For example, the ''triplet loss'' function, developed by Google, takes 3 inputs: positive input, negative input and anchor input. This makes sense because for face recognition, we want to model to learn not only what it is supposed to predict but also what it is not supposed to predict. Typically, triplet loss handles false positives much better. This paper can extend its scope to such loss function that takes more than 2 inputs.<br />
* It would be good to also know what the training time is like for the method, specifically the "Reward-Guided Search" which uses RL. Also the authors mention some data preprocessing that was performed, was this same preprocessing also performed for the methods they compared against?<br />
* Sections on Data Processing and Results can be improved. About the datasets, I have some questions about why they are divided in the current fashion. It is mentioned that "CASIA-WebFace and MS-Celeb-1M-v1c" are used as training datasets. But the comparison of algorithms are divided into three groups: Megaface and TrillionPairs, RFW, and a group of other datasets. In general, when we are comparing algorithms, we want to have a holistic view of how each algorithm compare. So I have some concerns about dividing the results into three section. More explanation can be provided. It also seems like Random-Softmax and Search Softmax outperform all other algorithms across all datasets. So it would make even more sense to have a big table including all the results. About data preprocessing, I believe that giving more information about which noisy data are removed would be nice.<br />
* Despite thorough comparison between each method against the proposed method, it does not give a reason to why it was the case that it was either better or worse, and it does not necessarily need to be a mathematical explanation but an intuitive one to demonstrate how it can be replicated and whether the results require a certain condition to achieve. <br />
* Though we have a graph demonstrating the training loss with Random-Softmax and Search-Softmax with regards to the number of Epochs as an independent variable which we may deduce the number of epochs used in later graphs but since one of the main features is that "Meanwhile, our optimization strategy enables that the dynamic loss can guide<br />
* Did the paper address why the average model performs worse on African faces, would it be a lack of data points?<br />
the model training of different epochs, which helps further boost the discrimination power." it is imperative that the results are comparable along the same scale (for example, for 20 epochs, then take the average of the losses).<br />
* The result summary is overwhelming with numbers and representation of result is lacking. It would be great if the result can be explained. Introduction of model and its component is lacking and could be explained more.<br />
* It would be better if the paper contains some Face Recognition visualization, i.e. show actually face recognition example to show the improvement.<br />
* The introduction of data and the analysis of data processing are important because there might be some limitations. Also, it would be better to give theoretical analysis of the effects of reducing softmax probability and the number of sampled models, which explains the update of the parameters for better performance.<br />
* It would be better to include time performance in the evaluation section.<br />
* The paper is missing details on datasets. It would be better to know if the datasets were balanced or unbalanced and how this would affect the accuracy. Also, computational comparisons between the new loss function versus traditional method would be interesting to know.<br />
* The paper included a dataset that measures racial bias, however it is a widely known fact that majority of face recognition models are trained on biased and imbalanced datasets themselves. For example, AI that has bias towards classifying a black person as a prisoner since the training set of prisoners is predominantly black. A question that remains unanswered is how training a model using the proposed loss function helps to combat racial bias in machine learning, and how these results in particular improved (or worsened) with its use.<br />
* There are too much data in the conclusion part. A brief conclusion based on several sentences should be enough to present the ideas.<br />
* The author could add the time efficiency of fave recognition in the result to compare the models with other current models for facial recognition since nowadays many application that uses face recognition rely on fast recognition(e.g. unlock phone with face id)<br />
*It is interesting to see how loss function impacts face recognition. It would be better to see different evaluations based on different datasets. Not only accuracy is important, but also efficiency is significant.<br />
* It is important for facial recognition models to feature some level of noise, since for most uses, things like lighting and camera quality is drastically different. For a more practical use of this, datasets with a certain level of noise should be introduced and tested.<br />
* The paper omits the full details of the necessity of normalization within the model; as described in the original [https://arxiv.org/abs/1801.05599 paper] describing A-softmax for face verification they stressed the importance of normalization. Does the lack of detail mean these new softmax loss functions don't require layer normalization?<br />
<br />
== References ==<br />
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.<br />
<br />
[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.<br />
2020].<br />
<br />
[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020]. <br />
<br />
[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].<br />
<br />
[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Efficient_kNN_Classification_with_Different_Numbers_of_Nearest_Neighbors&diff=49686Efficient kNN Classification with Different Numbers of Nearest Neighbors2020-12-07T03:25:07Z<p>X277lu: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Cooper Brooke, Daniel Fagan, Maya Perelman<br />
<br />
== Introduction == <br />
Traditional model-based approaches for classification problem requires to train a model on training observations before predicting test samples. In contrast, the model-free k-Nearest Neighbors (KNNs) method classifies observations with a majority rule approach, labeling each piece of test data based on its k closest training observations (neighbors), meaning normalizing the training data can improve its accuracy a lot. This method has become very popular due to its relatively robust performance given how simple it is to implement. It is robust because the predicted value is only depend on the label of the closest data and that is not significantly affected by outliers.<br />
<br />
There are two main approaches to conduct kNN classification in respect of the choice for k. The first is to use a fixed k value to classify all test samples, while the second is to use a different k value each time, either for different k values for each test sample or different k values for each class. The former, while easy to implement, has shown to be impractical in real-world machine learning applications. It is more reasonable and practical to select a unique value of k for each test sample to allow for a better fit of the data. Therefore, it is of immense interest to develope an efficient way to determine the optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.<br />
<br />
== Previous Work and Motivation== <br />
<br />
The problem of finding an optimal fixed k value for all test samples is well-studied. Lall and Sharma [9] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involves selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications. <br />
<br />
Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning an optimal-k-value for each test sample or scanning all training samples for finding nearest neighbors are time-consuming. It is challenging for simultaneously addressing these issues of kNN method including optimal-k-values learning for different samples, time cost reduction, and performance improvement.<br />
<br />
Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seek to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.<br />
<br />
A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.<br />
<br />
== Approach == <br />
<br />
<br />
=== kTree Classification ===<br />
<br />
The proposed kTree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_1.png | center | 800x800px]]<br />
<br />
==== Reconstruction ====<br />
<br />
The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}\in \mathbb{R}^{d\times n} = [x_1,...,x_n]</math> represents the training set which can be written as:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2<br />
\end{aligned}$$<br />
<br />
In addition, regularization term can be added to avoid the issue of singularity and increase the robustness of the reconstruction:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho||\textbf{W}||^2_2<br />
\end{aligned}$$<br />
<br />
<math>l_2</math> regularization is the most commonly used to reduce overfitting for regression, and it is called ridge regression which has a close-form solution where <math>W = (X^TX+\rho I)^{-1}X^TX</math>. However, this objective function with <math>l_2</math> regulairzation does not provide a sparse result. With the goal of increaseing computational efficieny, we follow the literature and adopt the <math>l_1</math> regularization instead:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}||, \textbf{W}\geq 0<br />
\end{aligned}$$<br />
<br />
Generally, the larger the value of <math>\rho_1</math>, the more sparse the weight matrix <math>\textbf{W}</math> is.The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. It is penalized with the function: <br />
<br />
$$\frac{1}{2} \sum^{d}_{i,j} ||x^iW-x^jW||^2_2$$<br />
<br />
with sij denotes the relation between feature vectors. It uses a radial basis function kernel to calculate Sij. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:<br />
<br />
$$\begin{aligned}<br />
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})<br />
\end{aligned}$$<br />
<br />
where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features. The Laplacian matrix, also called the graph Laplacian, is a matrix representation of a graph. <br />
<br />
This gives a final objective function of:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})<br />
\end{aligned}$$<br />
<br />
Since this is a convex function, an iterative method can be used to find the optimal solution <math>\mathbf{W^*}</math>.<br />
<br />
==== Calculate ''k'' for training set ====<br />
<br />
Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.<br />
<br />
For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:<br />
<br />
[[File:Approach_Figure_2.png | center | 300x300px]]<br />
<br />
The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 are non-zero.<br />
<br />
==== Train a Decision Tree using ''k'' as the label ====<br />
<br />
A decision tree is trained using the traditional ID3 method;<br />
(1) calculate the entropy of every feature in your data set,<br />
(2) split the data-set based on the feature whose entropy is minimized after splitting (in the example below, this was feature a'),<br />
(3) make a decision tree node based on that feature,<br />
(4) repeat steps (1)-(3) recursively on the formed subsets using the remaining features, <br />
replacing the label by the previously learned optimal ''k'' value for each sample. More specifically, whereas in a normal decision tree, the target data are the labels themselves, in the kTree method, the target data is the optimal ''k'' value for each sample that was solved for in the previous step. As a result, the decision tree formed by the kTree method has the following form:<br />
<br />
[[File:Approach_Figure_3.png | center | 300x300px]]<br />
<br />
==== Making Predictions for Test Data ====<br />
<br />
The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbors across '''all''' of the training data.<br />
<br />
=== k*Tree Classification ===<br />
<br />
The proposed k*Tree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_4.png | center | 1000x1000px]]<br />
<br />
Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.<br />
<br />
While all steps previous are the exact same, the difference comes from additional data stored in the leaf nodes. k*Tree method not only stores the optimal ''k'' value but also the following information:<br />
<br />
* The training samples that have the same optimal ''k''<br />
* The ''k'' nearest neighbours of the previously identified training samples<br />
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours<br />
<br />
The data stored in each node is summarized in the following figure:<br />
<br />
[[File:Approach_Figure_5.png | center | 800x800px]]<br />
<br />
When testing, the constructed k*Tree is searched for its optimal k values well as its nearest neighbours in the leaf node. It then selects a number of its nearest neighbours from the subset of training samples and assigns the test sample with the majority label of these nearest neighbours.<br />
<br />
In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.<br />
<br />
== Experiments == <br />
<br />
In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:<br />
<br />
# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.<br />
# kNN-Based Applicability Domain Approach (AD-kNN) [11]<br />
# kNN Method Based on Sparse Learning (S-kNN) [10]<br />
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]<br />
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]<br />
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]<br />
<br />
The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features. <br />
<br />
<br />
'''A. Experimental Results on Different Sample Sizes'''<br />
<br />
The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.<br />
<br />
[[File:Table_I_kNN.png | center | 1000x1000px]]<br />
<br />
The following key results are noted:<br />
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.<br />
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.<br />
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.<br />
<br />
<br />
'''B. Experimental Results on Different Feature Numbers'''<br />
<br />
The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score, an algorithm that solves maximum likelihood equations numerically [15], was used to rank and select the most information features in the datasets. <br />
<br />
[[File:Table_II_kNN.png | center | 1000x1000px]]<br />
<br />
From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.<br />
<br />
== Conclusion == <br />
<br />
This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods can classify the test samples efficiently and effectively with a designed training step that reduces the run time of the test stage and thus enhances the performance. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. More future areas of investigation could focus on the improvement of kTree and k*Tree for high-dimensional data.<br />
<br />
== Critiques == <br />
<br />
*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy. In addition, the differences between different methods in classification error vary only by a small amount. <br />
<br />
* The authors addressed that some of the UCI datasets contained imbalanced data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.). <br />
*While the authors contrast their kTree and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.<br />
<br />
* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.<br />
<br />
* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.<br />
<br />
* It would be nice to have a comparison of the running costs of different methods to see how much faster kTree and k*Tree performed<br />
<br />
* It would be better to show the key result only on a summary rather than stacking up all results without screening.<br />
<br />
* In the results section, it was mentioned that in the experiment on data sets with different numbers of features, the kTree and k*Tree model did not achieve GS-kNN or S-kNN's accuracies, but was faster in terms of running cost. It might be helpful here if the authors add some more supporting arguments about the benefit of this tradeoff, which appears to be a minor decrease in accuracy for a large improvement in speed. This could further showcase the advantages of the kTree and k*Tree models. More quantitative analysis or real-life scenario examples could be some choices here.<br />
<br />
* An interesting thing to notice while solving for the optimal matrix <math>W^*</math> that minimizes the loss function is that <math>W^*</math> is not necessarily a symmetric matrix. That is, the correlation between the <math>i^{th}</math> entry and the <math>j^{th}</math> entry is different from that between the <math>j^{th}</math> entry and the <math>i^{th}</math> entry, which makes the resulting W* not really semantically meaningful. Therefore, it would be interesting if we may set a threshold on the allowing difference between the <math>ij^{th}</math> entry and the <math>ji^{th}</math> entry in <math>W^*</math> and see if this new configuration will give better or worse results compared to current ones, which will provide better insights of the algorithm.<br />
<br />
* It would be interesting to see how the proposed model works with highly non-linear datasets. In the event it does not work well, it would pose the question: would replacing the k*Tree with a SVM or a neural network improve the accuracy? There could be experiments to show if this variant would prove superior over the original models.<br />
<br />
* The key results are a little misleading - for example they claim "the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method" is false. The kTree method had slightly lower accuracy than both GS-kNN and S-kNN and kTree was also slower than LC-kNN<br />
<br />
* I want to point to the discussion on k*Tree's structure. In order for k*Tree to work effectively, its leaf nodes needs to store additional information. In addition to the optimal k value, it also needs to store things like the training samples that have the optimal k, and the k nearest neighbours of the previously identified training samples. How big of am impact does this structure have on storage cost? Since the number of leaf nodes can be large, the storage cost may be large as well. This can potentially make k*tree ineffective to use in practice, especially for very large datasets.<br />
<br />
* It would be better if the author can explain more on KTree method and the similarity of KTree method and KNN method.<br />
<br />
* Even though we are given a table with averages on the accuracy and mean running cost, it would have been nice to see a direct visual comparison in the figures followed below. In addition to comparing to other algorithms, it would be helpful to see the average expected cost of these algorithms to show as control or rather a standard to accuracy and compute cost to assess the overall general expected cost of running such classification algorithm to fully assess its efficacy.<br />
<br />
* It doesn't clearly mention what's the definition/similarity/difference between Ktree and KNN methods. If the authors could put some detailed explanations in the beginning, the flow of this paper would have been much better.<br />
<br />
* It would be better to know if the paper indicates the performance difference between small and large dataset. Would the performance increase be negligible in small features datasets?<br />
<br />
* It would be more clear if the experiment connect with the approach part tightly, like even just mention how to apply the approach to get these results.<br />
<br />
* It would be better if the author had provided several paragraphs discussing the complexity of these models. It seems like the highlight of kTree is that it offers similar performance at a significantly lower cost.<br />
<br />
* The author should compare the complexity of several different algorithms that gets the optimum k value, and discuss the pros and cons of each approach<br />
<br />
* It might be difficult to say k Tree classification is better than kNN. It might require a larger dataset and perform hyperparameter tuning on both methods.<br />
<br />
== References == <br />
<br />
[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.<br />
<br />
[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.<br />
<br />
[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.<br />
<br />
[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.<br />
<br />
[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.<br />
<br />
[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.<br />
<br />
[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.<br />
<br />
[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.<br />
<br />
[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.<br />
<br />
[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.<br />
<br />
[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013. <br />
<br />
[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.<br />
<br />
[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.<br />
<br />
[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.<br />
<br />
[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Efficient_kNN_Classification_with_Different_Numbers_of_Nearest_Neighbors&diff=49682Efficient kNN Classification with Different Numbers of Nearest Neighbors2020-12-07T03:24:18Z<p>X277lu: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Cooper Brooke, Daniel Fagan, Maya Perelman<br />
<br />
== Introduction == <br />
Traditional model-based approaches for classification problem requires to train a model on training observations before predicting test samples. In contrast, the model-free k-Nearest Neighbors (KNNs) method classifies observations with a majority rule approach, labeling each piece of test data based on its k closest training observations (neighbors), meaning normalizing the training data can improve its accuracy a lot. This method has become very popular due to its relatively robust performance given how simple it is to implement. It is robust because the predicted value is only depend on the label of the closest data and that is not significantly affected by outliers.<br />
<br />
There are two main approaches to conduct kNN classification in respect of the choice for k. The first is to use a fixed k value to classify all test samples, while the second is to use a different k value each time, either for different k values for each test sample or different k values for each class. The former, while easy to implement, has shown to be impractical in real-world machine learning applications. It is more reasonable and practical to select a unique value of k for each test sample to allow for a better fit of the data. Therefore, it is of immense interest to develope an efficient way to determine the optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.<br />
<br />
== Previous Work and Motivation== <br />
<br />
The problem of finding an optimal fixed k value for all test samples is well-studied. Lall and Sharma [9] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involves selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications. <br />
<br />
Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning an optimal-k-value for each test sample or scanning all training samples for finding nearest neighbors are time-consuming. It is challenging for simultaneously addressing these issues of kNN method including optimal-k-values learning for different samples, time cost reduction, and performance improvement.<br />
<br />
Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seek to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.<br />
<br />
A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.<br />
<br />
== Approach == <br />
<br />
<br />
=== kTree Classification ===<br />
<br />
The proposed kTree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_1.png | center | 800x800px]]<br />
<br />
==== Reconstruction ====<br />
<br />
The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}\in \mathbb{R}^{d\times n} = [x_1,...,x_n]</math> represents the training set which can be written as:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2<br />
\end{aligned}$$<br />
<br />
In addition, regularization term can be added to avoid the issue of singularity and increase the robustness of the reconstruction:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho||\textbf{W}||^2_2<br />
\end{aligned}$$<br />
<br />
<math>l_2</math> regularization is the most commonly used to reduce overfitting for regression, and it is called ridge regression which has a close-form solution where <math>W = (X^TX+\rho I)^{-1}X^TX</math>. However, this objective function with <math>l_2</math> regulairzation does not provide a sparse result. With the goal of increaseing computational efficieny, we follow the literature and adopt the <math>l_1</math> regularization instead:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}||, \textbf{W}\geq 0<br />
\end{aligned}$$<br />
<br />
Generally, the larger the value of <math>\rho_1</math>, the more sparse the weight matrix <math>\textbf{W}</math> is.The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. It is penalized with the function: <br />
<br />
$$\frac{1}{2} \sum^{d}_{i,j} ||x^iW-x^jW||^2_2$$<br />
<br />
with sij denotes the relation between feature vectors. It uses a radial basis function kernel to calculate Sij. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:<br />
<br />
$$\begin{aligned}<br />
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})<br />
\end{aligned}$$<br />
<br />
where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features. The Laplacian matrix, also called the graph Laplacian, is a matrix representation of a graph. <br />
<br />
This gives a final objective function of:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})<br />
\end{aligned}$$<br />
<br />
Since this is a convex function, an iterative method can be used to find the optimal solution <math>\mathbf{W^*}</math>.<br />
<br />
==== Calculate ''k'' for training set ====<br />
<br />
Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.<br />
<br />
For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:<br />
<br />
[[File:Approach_Figure_2.png | center | 300x300px]]<br />
<br />
The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 are non-zero.<br />
<br />
==== Train a Decision Tree using ''k'' as the label ====<br />
<br />
A decision tree is trained using the traditional ID3 method;<br />
(1) calculate the entropy of every feature in your data set,<br />
(2) split the data-set based on the feature whose entropy is minimized after splitting (in the example below, this was feature a'),<br />
(3) make a decision tree node based on that feature,<br />
(4) repeat steps (1)-(3) recursively on the formed subsets using the remaining features, <br />
replacing the label by the previously learned optimal ''k'' value for each sample. More specifically, whereas in a normal decision tree, the target data are the labels themselves, in the kTree method, the target data is the optimal ''k'' value for each sample that was solved for in the previous step. As a result, the decision tree formed by the kTree method has the following form:<br />
<br />
[[File:Approach_Figure_3.png | center | 300x300px]]<br />
<br />
==== Making Predictions for Test Data ====<br />
<br />
The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbors across '''all''' of the training data.<br />
<br />
=== k*Tree Classification ===<br />
<br />
The proposed k*Tree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_4.png | center | 1000x1000px]]<br />
<br />
Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.<br />
<br />
While all steps previous are the exact same, the difference comes from additional data stored in the leaf nodes. k*Tree method not only stores the optimal ''k'' value but also the following information:<br />
<br />
* The training samples that have the same optimal ''k''<br />
* The ''k'' nearest neighbours of the previously identified training samples<br />
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours<br />
<br />
The data stored in each node is summarized in the following figure:<br />
<br />
[[File:Approach_Figure_5.png | center | 800x800px]]<br />
<br />
When testing, the constructed k*Tree is searched for its optimal k values well as its nearest neighbours in the leaf node. It then selects a number of its nearest neighbours from the subset of training samples and assigns the test sample with the majority label of these nearest neighbours.<br />
<br />
In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.<br />
<br />
== Experiments == <br />
<br />
In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:<br />
<br />
# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.<br />
# kNN-Based Applicability Domain Approach (AD-kNN) [11]<br />
# kNN Method Based on Sparse Learning (S-kNN) [10]<br />
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]<br />
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]<br />
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]<br />
<br />
The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features. <br />
<br />
<br />
'''A. Experimental Results on Different Sample Sizes'''<br />
<br />
The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.<br />
<br />
[[File:Table_I_kNN.png | center | 1000x1000px]]<br />
<br />
The following key results are noted:<br />
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.<br />
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.<br />
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.<br />
<br />
<br />
'''B. Experimental Results on Different Feature Numbers'''<br />
<br />
The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score, an algorithm that solves maximum likelihood equations numerically [15], was used to rank and select the most information features in the datasets. <br />
<br />
[[File:Table_II_kNN.png | center | 1000x1000px]]<br />
<br />
From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.<br />
<br />
== Conclusion == <br />
<br />
This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods can classify the test samples efficiently and effectively with designing a training step that reduces the run time of the test stage and thus enhances the performance. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. More future areas of investigation could focus on the improvement of kTree and k*Tree for high-dimensional data.<br />
<br />
== Critiques == <br />
<br />
*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy. In addition, the differences between different methods in classification error vary only by a small amount. <br />
<br />
* The authors addressed that some of the UCI datasets contained imbalanced data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.). <br />
*While the authors contrast their kTree and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.<br />
<br />
* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.<br />
<br />
* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.<br />
<br />
* It would be nice to have a comparison of the running costs of different methods to see how much faster kTree and k*Tree performed<br />
<br />
* It would be better to show the key result only on a summary rather than stacking up all results without screening.<br />
<br />
* In the results section, it was mentioned that in the experiment on data sets with different numbers of features, the kTree and k*Tree model did not achieve GS-kNN or S-kNN's accuracies, but was faster in terms of running cost. It might be helpful here if the authors add some more supporting arguments about the benefit of this tradeoff, which appears to be a minor decrease in accuracy for a large improvement in speed. This could further showcase the advantages of the kTree and k*Tree models. More quantitative analysis or real-life scenario examples could be some choices here.<br />
<br />
* An interesting thing to notice while solving for the optimal matrix <math>W^*</math> that minimizes the loss function is that <math>W^*</math> is not necessarily a symmetric matrix. That is, the correlation between the <math>i^{th}</math> entry and the <math>j^{th}</math> entry is different from that between the <math>j^{th}</math> entry and the <math>i^{th}</math> entry, which makes the resulting W* not really semantically meaningful. Therefore, it would be interesting if we may set a threshold on the allowing difference between the <math>ij^{th}</math> entry and the <math>ji^{th}</math> entry in <math>W^*</math> and see if this new configuration will give better or worse results compared to current ones, which will provide better insights of the algorithm.<br />
<br />
* It would be interesting to see how the proposed model works with highly non-linear datasets. In the event it does not work well, it would pose the question: would replacing the k*Tree with a SVM or a neural network improve the accuracy? There could be experiments to show if this variant would prove superior over the original models.<br />
<br />
* The key results are a little misleading - for example they claim "the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method" is false. The kTree method had slightly lower accuracy than both GS-kNN and S-kNN and kTree was also slower than LC-kNN<br />
<br />
* I want to point to the discussion on k*Tree's structure. In order for k*Tree to work effectively, its leaf nodes needs to store additional information. In addition to the optimal k value, it also needs to store things like the training samples that have the optimal k, and the k nearest neighbours of the previously identified training samples. How big of am impact does this structure have on storage cost? Since the number of leaf nodes can be large, the storage cost may be large as well. This can potentially make k*tree ineffective to use in practice, especially for very large datasets.<br />
<br />
* It would be better if the author can explain more on KTree method and the similarity of KTree method and KNN method.<br />
<br />
* Even though we are given a table with averages on the accuracy and mean running cost, it would have been nice to see a direct visual comparison in the figures followed below. In addition to comparing to other algorithms, it would be helpful to see the average expected cost of these algorithms to show as control or rather a standard to accuracy and compute cost to assess the overall general expected cost of running such classification algorithm to fully assess its efficacy.<br />
<br />
* It doesn't clearly mention what's the definition/similarity/difference between Ktree and KNN methods. If the authors could put some detailed explanations in the beginning, the flow of this paper would have been much better.<br />
<br />
* It would be better to know if the paper indicates the performance difference between small and large dataset. Would the performance increase be negligible in small features datasets?<br />
<br />
* It would be more clear if the experiment connect with the approach part tightly, like even just mention how to apply the approach to get these results.<br />
<br />
* It would be better if the author had provided several paragraphs discussing the complexity of these models. It seems like the highlight of kTree is that it offers similar performance at a significantly lower cost.<br />
<br />
* The author should compare the complexity of several different algorithms that gets the optimum k value, and discuss the pros and cons of each approach<br />
<br />
* It might be difficult to say k Tree classification is better than kNN. It might require a larger dataset and perform hyperparameter tuning on both methods.<br />
<br />
== References == <br />
<br />
[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.<br />
<br />
[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.<br />
<br />
[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.<br />
<br />
[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.<br />
<br />
[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.<br />
<br />
[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.<br />
<br />
[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.<br />
<br />
[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.<br />
<br />
[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.<br />
<br />
[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.<br />
<br />
[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013. <br />
<br />
[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.<br />
<br />
[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.<br />
<br />
[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.<br />
<br />
[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Music_Recommender_System_Based_using_CRNN&diff=49677Music Recommender System Based using CRNN2020-12-07T03:19:27Z<p>X277lu: /* Accuracy Metrics */</p>
<hr />
<div>==Introduction and Objective:==<br />
<br />
In the digital era of music streaming, companies, such as Spotify and Pandora, are faced with the following challenge: can they provide users with relevant and personalized music recommendations amidst the ever-growing abundance of music and user data?<br />
<br />
The objective of this paper is to implement a personalized music recommender system that takes user listening history as input and continually finds new music that captures individual user preferences.<br />
<br />
This paper argues that a music recommendation system should vary from the general recommendation system used in practice since it should combine music feature recognition and audio processing technologies to extract music features, and combine them with data on user preferences.<br />
<br />
The authors of this paper took a content-based music approach to build the recommendation system - specifically, comparing the similarity of features based on the audio signal.<br />
<br />
The following two-method approach for building the recommendation system was followed:<br />
#Make recommendations including genre information extracted from classification algorithms.<br />
#Make recommendations without genre information.<br />
<br />
The authors used convolutional recurrent neural networks (CRNN), which is a combination of 2D convolutional neural networks (CNN) and recurrent neural network(RNN), as their main classification model.<br />
<br />
==Methods and Techniques:==<br />
Generally, a music recommender can be divided into three main parts: (i) users, (ii) items, and (iii) user-item matching algorithms. Firstly, a model for a user's music taste is generated based on their profiles. Secondly, item profiling based on editorial, cultural, and acoustic metadata is exploited to increase listener satisfaction. Thirdly, a matching algorithm is employed to recommend personalized music to the listener. Two main approaches are currently available;<br />
<br />
1. Collaborative filtering<br />
<br />
It is based on users' historical listening data and depends on user ratings. Nearest neighbour is the standard method used for collaborative filtering and can be broken into two classes of methods: (i) user-based neighbourhood methods and (ii) item-based neighbourhood methods. <br />
<br />
User-based neighbourhood methods calculate the similarity between the target user and other users, and selects the k most similar. A weighted average of the most similar users' song ratings is then computed to predict how the target user would rate those songs. Songs that have a high predicted rating are then recommended to the user. In contrast, methods that use item-based neighbourhoods calculate similarities between songs that the target user has rated well and songs they have not listened to in order to recommend songs.<br />
<br />
That being said, collaborative filtering faces many challenges. For example, given that each user sees only a small portion of all music libraries, sparsity and scalability become an issue. However, this can be dealt with using matrix factorization. A more difficult challenge to overcome is the fact that users often don't rate songs when they are listening to music. <br />
<br />
2. Content-based filtering<br />
<br />
Content based recommendation systems base their recommendations on the similarity of an items features and features that the user has enjoyed. It has two-steps; (i) Extract audio content features and (ii) predict user preferences.<br />
<br />
However content-based filtering has to overcome the challenge of only being able to predict based on users' existing interests. The model is unable to effectively scale to a user's ever changing music taste.<br />
<br />
In this work, the authors take a content-based approach, as they compare the similarity of audio signal features to make recommendations. To classify music, the original music’s audio signal is converted into a spectrogram image. Using the image and the Short Time Fourier Transform (STFT), we convert the data into the Mel scale which is used in the CNN and CRNN models. <br />
=== Mel Scale: === <br />
The scale of pitches that are heard by listeners, which translates to equal pitch increments.<br />
<br />
[[File:Mel.png|frame|none|Mel Scale on Spectrogram]]<br />
<br />
=== Short Time Fourier Transform (STFT): ===<br />
The transformation that determines the sinusoidal frequency of the audio, with a Hanning smoothing function. In the continuous case this is written as: <math>\mathbf{STFT}\{x(t)\}(\tau,\omega) \equiv X(\tau, \omega) = \int_{-\infty}^{\infty} x(t) w(t-\tau) e^{-i \omega t} \, d t </math><br />
<br />
where: <math>w(\tau)</math> is the Hanning smoothing function. The STFT is applied over a specified window length at a certain time allowing the frequency to represented for that given window rather than the entire signal as a typical Fourier Transform would.<br />
<br />
=== Convolutional Neural Network (CNN): ===<br />
A Convolutional Neural Network is a Neural Network that uses convolution in place of matrix multiplication for some layer calculations. By training the data, weights for inputs are updated to find the most significant data relevant to classification. These convolutional layers gather small groups of data with kernels and try to find patterns that can help find features in the overall data. The features are then used for classification. Padding is another technique used to extend the pixels on the edge of the original image to allow the kernel to more accurately capture the borderline pixels. Padding is also used if one wishes the convolved output image to have a certain size. The image on the left represents the mathematical expression of a convolution operation, while the right image demonstrates an application of a kernel on the data.<br />
<br />
[[File:Convolution.png|thumb|400px|left|Convolution Operation]]<br />
[[File:PaddingKernels.png|thumb|400px|center|Example of Padding (white 0s) and Kernels (blue square)]]<br />
<br />
=== Convolutional Recurrent Neural Network (CRNN): === <br />
The CRNN is similar to the architecture of a CNN, but with the addition of a Gated Recurrent Unit (GRU), which is a Recurrent Neural Network (RNN). An RNN is used to treat sequential data, by reusing the activation function of previous nodes to update the output. The GRU is used to store more long-term memory and will help train the early hidden layers. GRUs can be thought of as LSTMs but with a forget gate, and has fewer parameters than an LSTM. These gates are used to determine how much information from the past should be passed along onto the future. They are originally aimed to prevent the vanishing gradient problem, since deeper networks will result in smaller and smaller gradients at each layer. The GRU can choose to copy over all the information in the past, thus eliminating the risk of vanishing gradients.<br />
<br />
[[File:GRU441.png|thumb|400px|left|Gated Recurrent Unit (GRU)]]<br />
[[File:Recurrent441.png|thumb|400px|center|Diagram of General Recurrent Neural Network]]<br />
<br />
==Data Screening:==<br />
<br />
The authors of this paper used a publicly available music dataset made up of 25,000 30-second songs from the Free Music Archives which contains 16 different genres. The data is cleaned up by removing low audio quality songs, wrongly labelled genres and those that have multiple genres. To ensure a balanced dataset, only 1000 songs each from the genres of classical, electronic, folk, hip-hop, instrumental, jazz and rock were used in the final model. <br />
<br />
[[File:Data441.png|thumb|200px|none|Data sorted by music genre]]<br />
<br />
==Implementation:==<br />
<br />
=== Modeling Neural Networks ===<br />
<br />
As noted previously, both CNNs and CRNNs were used to model the data. The advantage of CRNNs is that they are able to model time sequence patterns in addition to frequency features from the spectrogram, allowing for greater identification of important features. Furthermore, feature vectors produced before the classification stage could be used to improve accuracy. <br />
<br />
In implementing the neural networks, the Mel-spectrogram data was split up into training, validation, and test sets at a ratio of 8:1:1 respectively and labelled via one-hot encoding. This made it possible for the categorical data to be labelled correctly for binary classification. As opposed to classical stochastic gradient descent, the authors opted to use binary classier and ADAM optimization to update weights in the training phase, and parameters of <math>\alpha = 0.001, \beta_1 = 0.9, \beta_2 = 0.999</math>. Binary cross-entropy was used as the loss function. <br />
Input spectrogram image are 96x1366. In both the CNN and CRNN models, the data was trained over 100 epochs with a batch size of 50 (limited computing power) and using binary cross-entropy as the loss function. Notable model specific details are below:<br />
<br />
'''CNN'''<br />
* Five convolutional layers with 3x3 kernel, stride 1, padding, batch normalization, and ReLU activation<br />
* Max pooling layers <br />
* The sigmoid function was used as the output layer<br />
<br />
'''CRNN'''<br />
* Four convolutional layers with 3x3 kernel (which construct a 2D temporal pattern - two layers of RNNs with Gated Recurrent Units), stride 1, padding, batch normalization, ReLU activation, and dropout rate 0.1<br />
* Feature maps are N x1x15 (N = number of features maps, 68 feature maps in this case) is used for RNNs.<br />
* 4 Max pooling layers for four convolutional layers with kernel ((2x2)-(3x3)-(4x4)-(4x4)) and same stride<br />
* The sigmoid function was used as the output layer<br />
<br />
The CNN and CRNN architecture is also given in the charts below.<br />
<br />
[[File:CNN441.png|thumb|800px|none|Implementation of CNN Model]]<br />
[[File:CRNN441.png|thumb|800px|none|Implementation of CRNN Model]]<br />
<br />
=== Music Recommendation System ===<br />
<br />
The recommendation system utilizes cosine similarity of the extracted features from the neural network to compute similarity. Each genre will have a song act as a centre point for each class. The final inputs of the trained neural networks will be the feature variables. The feature variables will be used in the cosine similarity to find the best recommendations. <br />
<br />
The values are between [-1,1], where larger values are songs that have similar features. When the user inputs five songs, those songs become the new inputs in the neural networks and the features are used by the cosine similarity with other music. The largest five cosine similarities are used as recommendations.<br />
[[File:Cosine441.png|frame|100px|none|Cosine Similarity]]<br />
<br />
== Evaluation Metrics ==<br />
=== Precision: ===<br />
* The proportion of True Positives with respect to the '''predicted''' positive cases (true positives and false positives)<br />
* For example, out of all the songs that the classifier '''predicted''' as Classical, how many are actually Classical?<br />
* Describes the rate at which the classifier predicts the true genre of songs among those predicted to be of that certain genre<br />
<br />
=== Recall: ===<br />
* The proportion of True Positives with respect to the '''actual''' positive cases (true positives and false negatives)<br />
* For example, out of all the songs that are '''actually''' Classical, how many are correctly predicted to be Classical?<br />
* Describes the rate at which the classifier predicts the true genre of songs among the correct instances of that genre<br />
<br />
=== F1-Score: ===<br />
An accuracy metric that combines the classifier’s precision and recall scores by taking the harmonic mean between the two metrics:<br />
<br />
[[File:F1441.png|frame|100px|none|F1-Score]]<br />
<br />
=== Receiver operating characteristics (ROC): ===<br />
* A graphical metric that is used to assess a classification model at different classification thresholds <br />
* In the case of a classification threshold of 0.5, this means that if <math>P(Y = k | X = x) > 0.5</math> then we classify this instance as class k<br />
* Plots the true positive rate versus false positive rate as the classification threshold is varied<br />
<br />
[[File:ROCGraph.jpg|thumb|400px|none|ROC Graph. Comparison of True Positive Rate and False Positive Rate]]<br />
<br />
=== Area Under the Curve (AUC) ===<br />
AUC is the area under the ROC in doing so, the ROC provides an aggregate measure across all possible classification thresholds.<br />
<br />
In the context of the paper: When scoring all songs as <math>Prob(Classical | X=x)</math>, it is the probability that the model ranks a random Classical song at a higher probability than a random non-Classical song.<br />
<br />
[[File:AUCGraph.jpg|thumb|400px|none|Area under the ROC curve.]]<br />
<br />
== Results ==<br />
=== Accuracy Metrics ===<br />
The table below is the accuracy metrics with the classification threshold of 0.5.<br />
<br />
[[File:TruePositiveChart.jpg|thumb|none|True Positive / False Positive Chart]]<br />
On average, CRNN outperforms CNN in true positive and false positive cases. In addition, it is very apparent that false positive rate is much more higher for songs in the Instrumental genre, perhaps indicating that more pre-processing needs to be done for songs in this genre or that it should be excluded from the analysis completely since most music incorporates instrumental components.<br />
<br />
<br />
[[File:F1Chart441.jpg|thumb|400px|none|F1 Chart]]<br />
On average, CRNN outperforms CNN in F1-score. <br />
<br />
<br />
[[File:AUCChart.jpg|thumb|400px|none|AUC Chart]]<br />
On average, CRNN also outperforms CNN in AUC metric.<br />
<br />
<br />
CRNN models that consider the frequency features and time sequence patterns of songs have a better classification performance through metrics such as F1 score and AUC compared to the CNN classifier.<br />
<br />
=== Evaluation of Music Recommendation System: ===<br />
<br />
* A listening experiment was performed with 30 participants to assess user responses to given music recommendations.<br />
* Participants choose 5 pieces of music they enjoy and the recommender system generates 5 new recommendations. The participants then evaluate the recommendation by recording whether they liked or disliked the music recommendation<br />
* The recommendation system takes two approaches to the recommendation:<br />
** Method one uses only the value of cosine similarity.<br />
** Method two uses the value of cosine similarity and information on music genre.<br />
*Perform test of significance of differences in average user likes between the two methods using a t-statistic:<br />
[[File:H0441.png|frame|100px|none|Hypothesis test between method 1 and method 2]]<br />
<br />
Comparing the two methods, <math> H_0: u_1 - u_2 = 0</math>, we have <math> t_{stat} = -4.743 < -2.037 </math>, which demonstrates that the increase in average user likes with the addition of music genre information is statistically significant.<br />
<br />
== Conclusion: ==<br />
<br />
The two two main conclusions obtained from this paper:<br />
<br />
* The music genre should be a key feature to increase the predictive capabilities of the music recommendation system.<br />
<br />
* To extract the song genre from a song’s audio signals and get overall better performance, CRNN’s are superior to CNN’s as they consider frequency in features and time sequence patterns of audio signals. <br />
<br />
According to the paper, the authors suggested adding other music features like tempo gram for capturing local tempo as a way to improve the accuracy of the recommender system.<br />
<br />
== Critiques/ Insights: ==<br />
# It would be helpful if authors bench-mark their novel approach with other recommendation algorithms such as collaborative filtering to see if there is a lift in predictive capabilities.<br />
# The listening experiment used to evaluate the recommendation system only includes songs that are outputted by the model. Users may be biased if they believe all songs have come from a recommendation system. To remove bias, we suggest having 15 songs where 5 songs are recommended and 10 songs are set. With this in the user’s mind, it may remove some bias in response and give more accurate predictive capabilities. <br />
# It would be better if they go into more details about how CRNN makes it perform better than CNN, in terms of attributes of each network. Also, it might be helpful to include the motivations why CNN and CRNN are useful in this specific problem. <br />
# The methodology introduced in this paper is probably also suitable for movie recommendations. As music is presented as spectrograms (images) in a time sequence, and it is very similar to a movie. <br />
# The way of evaluation is a very interesting approach. Since it's usually not easy to evaluate the testing result when it's subjective. By listing all these evaluations' performance, the result would be more comprehensive. A practice that might reduce bias is by coming back to the participants after a couple of days and asking whether they liked the music that was recommended. Often times music "grows" on people and their opinion of a new song may change after some time has passed. <br />
# The paper lacks the comparison between the proposed algorithm and the music recommendation algorithms being used now. It will be clearer to show the superiority of this algorithm.<br />
# The GAN neural network has been proposed to enhance the performance of the neural network, so an improved result may appear after considering using GAN.<br />
# The limitation of CNN and CRNN could be that they are only able to process the spectrograms with single labels rather than multiple labels. This is far from enough for the music recommender systems in today's music industry since the edges between various genres are blurred.<br />
# Is it possible for CNN and CRNN to identify different songs? The model would be harder to train, based on my experience, the efficiency of CNN in R is not very high, which can be improved for future work.<br />
# According to the author, the recommender system is done by calculating the cosine similarity of extraction features from one music to another music. Is possible to represent it by Euclidean distance or p-norm distances?<br />
# In real-life application, most of the music software will have the ability to recommend music to the listener and ask do they like the music that was recommended. It would be a nice application by involving some new information from the listener.<br />
# Actual music listeners do not listen to one genre of music, and in fact listening to the same track or the same genre would be somewhat unusual. Could this method be used to make recommendations not on genre, but based on other categories? (Such as the theme of the lyrics, the pitch of the singer, or the date published). Would this model be able to differentiate between tracks of varying "lyric vocabulation difficulty"? Or would NLP algorithms be needed to consider lyrics?<br />
# This model can be applied to many other fields such as recommending the news in the news app, recommending things to buy in the amazon, recommending videos to watch in YOUTUBE and so on based on the user information.<br />
# Looks like for the most genres, CRNN outperforms CNN, but CNN did do better on a few genres (like Jazz), so it might be better to mix them together or might use CNN for some genres and CRNN for the rest.<br />
# Cosine similarity is used to find songs with similar patterns as the input ones from users. That is, feature variables are extracted from the trained neural network model before the classification layer, and used as the basis to find similar songs. One potential problem of this approach is that if the neural network classifies an input song incorrectly, the extracted feature vector will not be a good representation of the input song. Thus, a song that is in fact really similar to the input song may have a small cosine similarity value, i.e. not be recommended. In conclusion, if the first classification is wrong, future inferences based on that is going to make it deviate further from the true answer. A possible future improvement will be how to offset this inference error.<br />
# In the tables when comparing performance and accuracies of the CNN and CRNN models on different genres of music, the researchers claimed that CRNN had superior performance to CNN models. This seemed intuitive, especially in the cases when the differences in accuracies were large. However, maybe the researchers should consider including some hypothesis testing statistics in such tables, which would support such claims in a more rigorous manner.<br />
# A music recommender system that doesn't use the song's meta data such as artist and genre and rather tries to classify genre itself seems unproductive. I also believe that the specific artist matters much more than the genre since within a genre you have many different styles. It just seems like the authors hamstring their recommender system by excluding other relevant data.<br />
# The genres that are posed in the paper are very broad and may not be specific enough to distinguish a listeners actual tastes (ie, I like rock and roll, but not punk rock, which could both be in the "rock" category). It would be interesting to run similar experiments with more concrete and specific genres to study the possibility of improving accuracy in the model.<br />
# This summary is well organized with detailed explanation to the music recommendation algorithm. However, since the data used in this paper is cleaned to buffer the efficiency of the recommendation, there should be a section evaluating the impact of noise on the performance this algorithm and how to minimize the impact.<br />
# This method will be better if the user choose some certain music genres that they like while doing the sign-up process. This is similar to recommending articles on twitter.<br />
# I have some feedback for the "Evaluation of Music Recommendation System" section. Firstly, there can be a brief mention of the participants' background information. Secondly, the summary mentions that "participants choose 5 pieces of music they enjoyed". Are they free to choose any music they like, or are they choosing from a pool of selections? What are the lengths of these music pieces? Lastly, method one and method two are compared against each other. It's intuitive that method two will outperform method one, since method two makes use of both cosine similarity and information on music genre, whereas method one only makes use of cosine similarity. Thus, saying method two outperforms method one is not necessarily surprising. I would like to see more explanation on why these methods are chosen, and why comparing them directly is considered to be fair.<br />
# It would be better to have more comparison with other existing music recommender system.<br />
# In the Collecting Music Data section, the author has indicated that for maintaining the balance of data for each genre that they are choosing to omit some genres and a portion of the dataset. However, how this was done was not explained explicitly which can be a concern for results replication. It would be better to describe the steps and measures taken to ensure the actions taken by the teams are reproducible. <br />
# For cleaning data, for training purposes, the team is choosing to omit the ones with lower music quality. While this is a sound option, it can be adjusted that the ratings for the music are deducted to adjust the balance. This could be important since a poor music quality could mean either equipment failure or corrupt server storage or it was a recording of a live performance that often does not have a perfect studio quality yet it would be loved by many real-life users. This omission is not entirely justified and feels like a deliberate adjustment for later results.<br />
# It would be more convincing if the author could provide more comparison between CRNN and CNN.<br />
# How is the result used to recommend songs within genres? It looks like it only predicts what genre the user likes to listen and recommends one of the songs from that genre. How can this recommender system be used to recommend songs within the same genre?<br />
# This [https://arxiv.org/pdf/2006.15795.pdf paper] implements CRNN differently; the CNN and RNN are separate and their resulting matrices and combined later. Would using this version of the CRNN potentially improve the accuracy?<br />
# This kind of approach can be used in implementing other recommender systems for, like movies, articles, news, websites etc. It would be helpful if the author could explain and generalize the implementation on other forms of recommender systems.<br />
# The accuracy of the genre classifier seemed really low, considering how distinct the genres sound to humans. The authors recommend adding features to the data but these could likely be extracted from the audio signal. Extra preprocessing would likely go a long way to improve the accuracy.<br />
# Since it was mentioned that different genres were used, it would be interesting to know if the model can classify different languages and how it performs with songs in different languages.<br />
# It is possible to extend this application to classifying baroque, classical, and romantic genre music. This can be beneficial for students (and frankly, people of all ages) who are learning about music. What's even more interesting to see is if this algorithm can distinguish music pieces written by classical musicians such as Beethoven, Haydn, and Mozart. Of course, it would take more effort in distinguishing features across the music pieces of these three artists, but it's an area worth exploring.<br />
# In contrast to the mel spectrogram method, the popular streaming app Spotify allows you to view data they collect that includes features about the nature of the of song such as acousticness, danceability, loudness, tempo, etc.<br />
# The authors introduced a good recommendation system. It might be helpful to evaluate just based on how similar users are. Meanwhile, there might be some bias need to be considered in the dataset. PCA might be a good way to analyze the existing pattern.<br />
# This models needs a way to introduce variance into its recommendations. The kind of music you wnat to listen to is largely dependent on non quantifiable factors like mood, but there is evidence to support that it depends on current events/activities. A way to introduce those factors into the recommendation system need to be looked at.<br />
<br />
== References: ==<br />
Nilashi, M., et.al. ''Collaborative Filtering Recommender Systems''. Research Journal of Applied Sciences, Engineering and Technology 5(16):4168-4182, 2013.<br />
Adiyansjah, Alexander A S Gunawan, Derwin Suhartono, Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks, Procedia Computer Science, https://doi.org/10.1016/j.procs.2019.08.146.</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Music_Recommender_System_Based_using_CRNN&diff=49676Music Recommender System Based using CRNN2020-12-07T03:18:36Z<p>X277lu: /* Critiques/ Insights: */</p>
<hr />
<div>==Introduction and Objective:==<br />
<br />
In the digital era of music streaming, companies, such as Spotify and Pandora, are faced with the following challenge: can they provide users with relevant and personalized music recommendations amidst the ever-growing abundance of music and user data?<br />
<br />
The objective of this paper is to implement a personalized music recommender system that takes user listening history as input and continually finds new music that captures individual user preferences.<br />
<br />
This paper argues that a music recommendation system should vary from the general recommendation system used in practice since it should combine music feature recognition and audio processing technologies to extract music features, and combine them with data on user preferences.<br />
<br />
The authors of this paper took a content-based music approach to build the recommendation system - specifically, comparing the similarity of features based on the audio signal.<br />
<br />
The following two-method approach for building the recommendation system was followed:<br />
#Make recommendations including genre information extracted from classification algorithms.<br />
#Make recommendations without genre information.<br />
<br />
The authors used convolutional recurrent neural networks (CRNN), which is a combination of 2D convolutional neural networks (CNN) and recurrent neural network(RNN), as their main classification model.<br />
<br />
==Methods and Techniques:==<br />
Generally, a music recommender can be divided into three main parts: (i) users, (ii) items, and (iii) user-item matching algorithms. Firstly, a model for a user's music taste is generated based on their profiles. Secondly, item profiling based on editorial, cultural, and acoustic metadata is exploited to increase listener satisfaction. Thirdly, a matching algorithm is employed to recommend personalized music to the listener. Two main approaches are currently available;<br />
<br />
1. Collaborative filtering<br />
<br />
It is based on users' historical listening data and depends on user ratings. Nearest neighbour is the standard method used for collaborative filtering and can be broken into two classes of methods: (i) user-based neighbourhood methods and (ii) item-based neighbourhood methods. <br />
<br />
User-based neighbourhood methods calculate the similarity between the target user and other users, and selects the k most similar. A weighted average of the most similar users' song ratings is then computed to predict how the target user would rate those songs. Songs that have a high predicted rating are then recommended to the user. In contrast, methods that use item-based neighbourhoods calculate similarities between songs that the target user has rated well and songs they have not listened to in order to recommend songs.<br />
<br />
That being said, collaborative filtering faces many challenges. For example, given that each user sees only a small portion of all music libraries, sparsity and scalability become an issue. However, this can be dealt with using matrix factorization. A more difficult challenge to overcome is the fact that users often don't rate songs when they are listening to music. <br />
<br />
2. Content-based filtering<br />
<br />
Content based recommendation systems base their recommendations on the similarity of an items features and features that the user has enjoyed. It has two-steps; (i) Extract audio content features and (ii) predict user preferences.<br />
<br />
However content-based filtering has to overcome the challenge of only being able to predict based on users' existing interests. The model is unable to effectively scale to a user's ever changing music taste.<br />
<br />
In this work, the authors take a content-based approach, as they compare the similarity of audio signal features to make recommendations. To classify music, the original music’s audio signal is converted into a spectrogram image. Using the image and the Short Time Fourier Transform (STFT), we convert the data into the Mel scale which is used in the CNN and CRNN models. <br />
=== Mel Scale: === <br />
The scale of pitches that are heard by listeners, which translates to equal pitch increments.<br />
<br />
[[File:Mel.png|frame|none|Mel Scale on Spectrogram]]<br />
<br />
=== Short Time Fourier Transform (STFT): ===<br />
The transformation that determines the sinusoidal frequency of the audio, with a Hanning smoothing function. In the continuous case this is written as: <math>\mathbf{STFT}\{x(t)\}(\tau,\omega) \equiv X(\tau, \omega) = \int_{-\infty}^{\infty} x(t) w(t-\tau) e^{-i \omega t} \, d t </math><br />
<br />
where: <math>w(\tau)</math> is the Hanning smoothing function. The STFT is applied over a specified window length at a certain time allowing the frequency to represented for that given window rather than the entire signal as a typical Fourier Transform would.<br />
<br />
=== Convolutional Neural Network (CNN): ===<br />
A Convolutional Neural Network is a Neural Network that uses convolution in place of matrix multiplication for some layer calculations. By training the data, weights for inputs are updated to find the most significant data relevant to classification. These convolutional layers gather small groups of data with kernels and try to find patterns that can help find features in the overall data. The features are then used for classification. Padding is another technique used to extend the pixels on the edge of the original image to allow the kernel to more accurately capture the borderline pixels. Padding is also used if one wishes the convolved output image to have a certain size. The image on the left represents the mathematical expression of a convolution operation, while the right image demonstrates an application of a kernel on the data.<br />
<br />
[[File:Convolution.png|thumb|400px|left|Convolution Operation]]<br />
[[File:PaddingKernels.png|thumb|400px|center|Example of Padding (white 0s) and Kernels (blue square)]]<br />
<br />
=== Convolutional Recurrent Neural Network (CRNN): === <br />
The CRNN is similar to the architecture of a CNN, but with the addition of a Gated Recurrent Unit (GRU), which is a Recurrent Neural Network (RNN). An RNN is used to treat sequential data, by reusing the activation function of previous nodes to update the output. The GRU is used to store more long-term memory and will help train the early hidden layers. GRUs can be thought of as LSTMs but with a forget gate, and has fewer parameters than an LSTM. These gates are used to determine how much information from the past should be passed along onto the future. They are originally aimed to prevent the vanishing gradient problem, since deeper networks will result in smaller and smaller gradients at each layer. The GRU can choose to copy over all the information in the past, thus eliminating the risk of vanishing gradients.<br />
<br />
[[File:GRU441.png|thumb|400px|left|Gated Recurrent Unit (GRU)]]<br />
[[File:Recurrent441.png|thumb|400px|center|Diagram of General Recurrent Neural Network]]<br />
<br />
==Data Screening:==<br />
<br />
The authors of this paper used a publicly available music dataset made up of 25,000 30-second songs from the Free Music Archives which contains 16 different genres. The data is cleaned up by removing low audio quality songs, wrongly labelled genres and those that have multiple genres. To ensure a balanced dataset, only 1000 songs each from the genres of classical, electronic, folk, hip-hop, instrumental, jazz and rock were used in the final model. <br />
<br />
[[File:Data441.png|thumb|200px|none|Data sorted by music genre]]<br />
<br />
==Implementation:==<br />
<br />
=== Modeling Neural Networks ===<br />
<br />
As noted previously, both CNNs and CRNNs were used to model the data. The advantage of CRNNs is that they are able to model time sequence patterns in addition to frequency features from the spectrogram, allowing for greater identification of important features. Furthermore, feature vectors produced before the classification stage could be used to improve accuracy. <br />
<br />
In implementing the neural networks, the Mel-spectrogram data was split up into training, validation, and test sets at a ratio of 8:1:1 respectively and labelled via one-hot encoding. This made it possible for the categorical data to be labelled correctly for binary classification. As opposed to classical stochastic gradient descent, the authors opted to use binary classier and ADAM optimization to update weights in the training phase, and parameters of <math>\alpha = 0.001, \beta_1 = 0.9, \beta_2 = 0.999</math>. Binary cross-entropy was used as the loss function. <br />
Input spectrogram image are 96x1366. In both the CNN and CRNN models, the data was trained over 100 epochs with a batch size of 50 (limited computing power) and using binary cross-entropy as the loss function. Notable model specific details are below:<br />
<br />
'''CNN'''<br />
* Five convolutional layers with 3x3 kernel, stride 1, padding, batch normalization, and ReLU activation<br />
* Max pooling layers <br />
* The sigmoid function was used as the output layer<br />
<br />
'''CRNN'''<br />
* Four convolutional layers with 3x3 kernel (which construct a 2D temporal pattern - two layers of RNNs with Gated Recurrent Units), stride 1, padding, batch normalization, ReLU activation, and dropout rate 0.1<br />
* Feature maps are N x1x15 (N = number of features maps, 68 feature maps in this case) is used for RNNs.<br />
* 4 Max pooling layers for four convolutional layers with kernel ((2x2)-(3x3)-(4x4)-(4x4)) and same stride<br />
* The sigmoid function was used as the output layer<br />
<br />
The CNN and CRNN architecture is also given in the charts below.<br />
<br />
[[File:CNN441.png|thumb|800px|none|Implementation of CNN Model]]<br />
[[File:CRNN441.png|thumb|800px|none|Implementation of CRNN Model]]<br />
<br />
=== Music Recommendation System ===<br />
<br />
The recommendation system utilizes cosine similarity of the extracted features from the neural network to compute similarity. Each genre will have a song act as a centre point for each class. The final inputs of the trained neural networks will be the feature variables. The feature variables will be used in the cosine similarity to find the best recommendations. <br />
<br />
The values are between [-1,1], where larger values are songs that have similar features. When the user inputs five songs, those songs become the new inputs in the neural networks and the features are used by the cosine similarity with other music. The largest five cosine similarities are used as recommendations.<br />
[[File:Cosine441.png|frame|100px|none|Cosine Similarity]]<br />
<br />
== Evaluation Metrics ==<br />
=== Precision: ===<br />
* The proportion of True Positives with respect to the '''predicted''' positive cases (true positives and false positives)<br />
* For example, out of all the songs that the classifier '''predicted''' as Classical, how many are actually Classical?<br />
* Describes the rate at which the classifier predicts the true genre of songs among those predicted to be of that certain genre<br />
<br />
=== Recall: ===<br />
* The proportion of True Positives with respect to the '''actual''' positive cases (true positives and false negatives)<br />
* For example, out of all the songs that are '''actually''' Classical, how many are correctly predicted to be Classical?<br />
* Describes the rate at which the classifier predicts the true genre of songs among the correct instances of that genre<br />
<br />
=== F1-Score: ===<br />
An accuracy metric that combines the classifier’s precision and recall scores by taking the harmonic mean between the two metrics:<br />
<br />
[[File:F1441.png|frame|100px|none|F1-Score]]<br />
<br />
=== Receiver operating characteristics (ROC): ===<br />
* A graphical metric that is used to assess a classification model at different classification thresholds <br />
* In the case of a classification threshold of 0.5, this means that if <math>P(Y = k | X = x) > 0.5</math> then we classify this instance as class k<br />
* Plots the true positive rate versus false positive rate as the classification threshold is varied<br />
<br />
[[File:ROCGraph.jpg|thumb|400px|none|ROC Graph. Comparison of True Positive Rate and False Positive Rate]]<br />
<br />
=== Area Under the Curve (AUC) ===<br />
AUC is the area under the ROC in doing so, the ROC provides an aggregate measure across all possible classification thresholds.<br />
<br />
In the context of the paper: When scoring all songs as <math>Prob(Classical | X=x)</math>, it is the probability that the model ranks a random Classical song at a higher probability than a random non-Classical song.<br />
<br />
[[File:AUCGraph.jpg|thumb|400px|none|Area under the ROC curve.]]<br />
<br />
== Results ==<br />
=== Accuracy Metrics ===<br />
The table below is the accuracy metrics with the classification threshold of 0.5.<br />
<br />
[[File:TruePositiveChart.jpg|thumb|none|True Positive / False Positive Chart]]<br />
On average, CRNN outperforms CNN in true positive and false positive cases. In addition, it is very apparent that false positives are much more frequent for songs in the Instrumental genre, perhaps indicating that more pre-processing needs to be done for songs in this genre or that it should be excluded from the analysis completely since most music incorporates instrumental components.<br />
<br />
<br />
[[File:F1Chart441.jpg|thumb|400px|none|F1 Chart]]<br />
On average, CRNN outperforms CNN in F1-score. <br />
<br />
<br />
[[File:AUCChart.jpg|thumb|400px|none|AUC Chart]]<br />
On average, CRNN also outperforms CNN in AUC metric.<br />
<br />
<br />
CRNN models that consider the frequency features and time sequence patterns of songs have a better classification performance through metrics such as F1 score and AUC compared to the CNN classifier.<br />
<br />
=== Evaluation of Music Recommendation System: ===<br />
<br />
* A listening experiment was performed with 30 participants to assess user responses to given music recommendations.<br />
* Participants choose 5 pieces of music they enjoy and the recommender system generates 5 new recommendations. The participants then evaluate the recommendation by recording whether they liked or disliked the music recommendation<br />
* The recommendation system takes two approaches to the recommendation:<br />
** Method one uses only the value of cosine similarity.<br />
** Method two uses the value of cosine similarity and information on music genre.<br />
*Perform test of significance of differences in average user likes between the two methods using a t-statistic:<br />
[[File:H0441.png|frame|100px|none|Hypothesis test between method 1 and method 2]]<br />
<br />
Comparing the two methods, <math> H_0: u_1 - u_2 = 0</math>, we have <math> t_{stat} = -4.743 < -2.037 </math>, which demonstrates that the increase in average user likes with the addition of music genre information is statistically significant.<br />
<br />
== Conclusion: ==<br />
<br />
The two two main conclusions obtained from this paper:<br />
<br />
* The music genre should be a key feature to increase the predictive capabilities of the music recommendation system.<br />
<br />
* To extract the song genre from a song’s audio signals and get overall better performance, CRNN’s are superior to CNN’s as they consider frequency in features and time sequence patterns of audio signals. <br />
<br />
According to the paper, the authors suggested adding other music features like tempo gram for capturing local tempo as a way to improve the accuracy of the recommender system.<br />
<br />
== Critiques/ Insights: ==<br />
# It would be helpful if authors bench-mark their novel approach with other recommendation algorithms such as collaborative filtering to see if there is a lift in predictive capabilities.<br />
# The listening experiment used to evaluate the recommendation system only includes songs that are outputted by the model. Users may be biased if they believe all songs have come from a recommendation system. To remove bias, we suggest having 15 songs where 5 songs are recommended and 10 songs are set. With this in the user’s mind, it may remove some bias in response and give more accurate predictive capabilities. <br />
# It would be better if they go into more details about how CRNN makes it perform better than CNN, in terms of attributes of each network. Also, it might be helpful to include the motivations why CNN and CRNN are useful in this specific problem. <br />
# The methodology introduced in this paper is probably also suitable for movie recommendations. As music is presented as spectrograms (images) in a time sequence, and it is very similar to a movie. <br />
# The way of evaluation is a very interesting approach. Since it's usually not easy to evaluate the testing result when it's subjective. By listing all these evaluations' performance, the result would be more comprehensive. A practice that might reduce bias is by coming back to the participants after a couple of days and asking whether they liked the music that was recommended. Often times music "grows" on people and their opinion of a new song may change after some time has passed. <br />
# The paper lacks the comparison between the proposed algorithm and the music recommendation algorithms being used now. It will be clearer to show the superiority of this algorithm.<br />
# The GAN neural network has been proposed to enhance the performance of the neural network, so an improved result may appear after considering using GAN.<br />
# The limitation of CNN and CRNN could be that they are only able to process the spectrograms with single labels rather than multiple labels. This is far from enough for the music recommender systems in today's music industry since the edges between various genres are blurred.<br />
# Is it possible for CNN and CRNN to identify different songs? The model would be harder to train, based on my experience, the efficiency of CNN in R is not very high, which can be improved for future work.<br />
# According to the author, the recommender system is done by calculating the cosine similarity of extraction features from one music to another music. Is possible to represent it by Euclidean distance or p-norm distances?<br />
# In real-life application, most of the music software will have the ability to recommend music to the listener and ask do they like the music that was recommended. It would be a nice application by involving some new information from the listener.<br />
# Actual music listeners do not listen to one genre of music, and in fact listening to the same track or the same genre would be somewhat unusual. Could this method be used to make recommendations not on genre, but based on other categories? (Such as the theme of the lyrics, the pitch of the singer, or the date published). Would this model be able to differentiate between tracks of varying "lyric vocabulation difficulty"? Or would NLP algorithms be needed to consider lyrics?<br />
# This model can be applied to many other fields such as recommending the news in the news app, recommending things to buy in the amazon, recommending videos to watch in YOUTUBE and so on based on the user information.<br />
# Looks like for the most genres, CRNN outperforms CNN, but CNN did do better on a few genres (like Jazz), so it might be better to mix them together or might use CNN for some genres and CRNN for the rest.<br />
# Cosine similarity is used to find songs with similar patterns as the input ones from users. That is, feature variables are extracted from the trained neural network model before the classification layer, and used as the basis to find similar songs. One potential problem of this approach is that if the neural network classifies an input song incorrectly, the extracted feature vector will not be a good representation of the input song. Thus, a song that is in fact really similar to the input song may have a small cosine similarity value, i.e. not be recommended. In conclusion, if the first classification is wrong, future inferences based on that is going to make it deviate further from the true answer. A possible future improvement will be how to offset this inference error.<br />
# In the tables when comparing performance and accuracies of the CNN and CRNN models on different genres of music, the researchers claimed that CRNN had superior performance to CNN models. This seemed intuitive, especially in the cases when the differences in accuracies were large. However, maybe the researchers should consider including some hypothesis testing statistics in such tables, which would support such claims in a more rigorous manner.<br />
# A music recommender system that doesn't use the song's meta data such as artist and genre and rather tries to classify genre itself seems unproductive. I also believe that the specific artist matters much more than the genre since within a genre you have many different styles. It just seems like the authors hamstring their recommender system by excluding other relevant data.<br />
# The genres that are posed in the paper are very broad and may not be specific enough to distinguish a listeners actual tastes (ie, I like rock and roll, but not punk rock, which could both be in the "rock" category). It would be interesting to run similar experiments with more concrete and specific genres to study the possibility of improving accuracy in the model.<br />
# This summary is well organized with detailed explanation to the music recommendation algorithm. However, since the data used in this paper is cleaned to buffer the efficiency of the recommendation, there should be a section evaluating the impact of noise on the performance this algorithm and how to minimize the impact.<br />
# This method will be better if the user choose some certain music genres that they like while doing the sign-up process. This is similar to recommending articles on twitter.<br />
# I have some feedback for the "Evaluation of Music Recommendation System" section. Firstly, there can be a brief mention of the participants' background information. Secondly, the summary mentions that "participants choose 5 pieces of music they enjoyed". Are they free to choose any music they like, or are they choosing from a pool of selections? What are the lengths of these music pieces? Lastly, method one and method two are compared against each other. It's intuitive that method two will outperform method one, since method two makes use of both cosine similarity and information on music genre, whereas method one only makes use of cosine similarity. Thus, saying method two outperforms method one is not necessarily surprising. I would like to see more explanation on why these methods are chosen, and why comparing them directly is considered to be fair.<br />
# It would be better to have more comparison with other existing music recommender system.<br />
# In the Collecting Music Data section, the author has indicated that for maintaining the balance of data for each genre that they are choosing to omit some genres and a portion of the dataset. However, how this was done was not explained explicitly which can be a concern for results replication. It would be better to describe the steps and measures taken to ensure the actions taken by the teams are reproducible. <br />
# For cleaning data, for training purposes, the team is choosing to omit the ones with lower music quality. While this is a sound option, it can be adjusted that the ratings for the music are deducted to adjust the balance. This could be important since a poor music quality could mean either equipment failure or corrupt server storage or it was a recording of a live performance that often does not have a perfect studio quality yet it would be loved by many real-life users. This omission is not entirely justified and feels like a deliberate adjustment for later results.<br />
# It would be more convincing if the author could provide more comparison between CRNN and CNN.<br />
# How is the result used to recommend songs within genres? It looks like it only predicts what genre the user likes to listen and recommends one of the songs from that genre. How can this recommender system be used to recommend songs within the same genre?<br />
# This [https://arxiv.org/pdf/2006.15795.pdf paper] implements CRNN differently; the CNN and RNN are separate and their resulting matrices and combined later. Would using this version of the CRNN potentially improve the accuracy?<br />
# This kind of approach can be used in implementing other recommender systems for, like movies, articles, news, websites etc. It would be helpful if the author could explain and generalize the implementation on other forms of recommender systems.<br />
# The accuracy of the genre classifier seemed really low, considering how distinct the genres sound to humans. The authors recommend adding features to the data but these could likely be extracted from the audio signal. Extra preprocessing would likely go a long way to improve the accuracy.<br />
# Since it was mentioned that different genres were used, it would be interesting to know if the model can classify different languages and how it performs with songs in different languages.<br />
# It is possible to extend this application to classifying baroque, classical, and romantic genre music. This can be beneficial for students (and frankly, people of all ages) who are learning about music. What's even more interesting to see is if this algorithm can distinguish music pieces written by classical musicians such as Beethoven, Haydn, and Mozart. Of course, it would take more effort in distinguishing features across the music pieces of these three artists, but it's an area worth exploring.<br />
# In contrast to the mel spectrogram method, the popular streaming app Spotify allows you to view data they collect that includes features about the nature of the of song such as acousticness, danceability, loudness, tempo, etc.<br />
# The authors introduced a good recommendation system. It might be helpful to evaluate just based on how similar users are. Meanwhile, there might be some bias need to be considered in the dataset. PCA might be a good way to analyze the existing pattern.<br />
# This models needs a way to introduce variance into its recommendations. The kind of music you wnat to listen to is largely dependent on non quantifiable factors like mood, but there is evidence to support that it depends on current events/activities. A way to introduce those factors into the recommendation system need to be looked at.<br />
<br />
== References: ==<br />
Nilashi, M., et.al. ''Collaborative Filtering Recommender Systems''. Research Journal of Applied Sciences, Engineering and Technology 5(16):4168-4182, 2013.<br />
Adiyansjah, Alexander A S Gunawan, Derwin Suhartono, Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks, Procedia Computer Science, https://doi.org/10.1016/j.procs.2019.08.146.</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Cvmustat&diff=49669User:Cvmustat2020-12-07T03:14:47Z<p>X277lu: /* Introduction */</p>
<hr />
<div><br />
== Combine Convolution with Recurrent Networks for Text Classification == <br />
'''Team Members''': Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea<br />
<br />
'''Date''': Week of Nov 23 <br />
<br />
== Introduction ==<br />
<br />
<br />
Text classification is the task of assigning a set of predefined categories to natural language texts. It involves learning an embedding layer which allows context-dependent classification. It is a fundamental task in Natural Language Processing (NLP) with various applications such as sentiment analysis and topic classification. A classic example involving text classification is given a set of News articles, is whether it is possible to classify the genre or subject of each article. Text classification is useful as text data is a rich source of information, but extracting insights from it directly can be difficult and time-consuming as most text data is unstructured.[1] NLP text classification can help automatically structure and analyze text quickly and cost-effectively, allowing for individuals to extract important features from the text easier than before. <br />
<br />
Text classification work mainly focuses on three topics: feature engineering, feature selection, and the use of different types of machine learning algorithms.<br />
:1. ''Feature engineering'': the most widely used feature is the bag of words feature. Some more complex functions are also designed, such as part-of-speech tags, noun phrases, and tree kernels.<br />
:2. ''Feature selection'': aims to remove noisy features and improve classification performance. The most common feature selection method is to delete stop words.<br />
:3. ''Machine learning algorithms'': usually use classifiers, such as Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machine (SVM).<br />
<br />
In practice, pre-trained word embeddings and deep neural networks are used together for NLP text classification. Word embeddings are used to map the raw text data to an implicit space where the semantic relationships of the words are preserved; words with similar meaning have a similar representation. One can then feed these embeddings into deep neural networks to learn different features of the text. Convolutional neural networks can be used to determine the semantic composition of the text(the meaning), as it treats texts as a 2D matrix by concatenating the embedding of words together. It uses a 1D convolution operator to perform the feature mapping and then conducts a 1D pooling operation over the time domain for obtaining a fixed-length output feature vector, and it can capture both local and position invariant features of the text.[2] Alternatively, Recurrent Neural Networks can be used to determine the contextual meaning of each word in the text (how each word relates to one another) by treating the text as sequential data and then analyzing each word separately. [3] Previous approaches to attempt to combine these two neural networks to incorporate the advantages of both models involve streamlining the two networks, which might decrease their performance. Besides, most methods incorporating a bi-directional Recurrent Neural Network usually concatenate the forward and backward hidden states at each time step, which results in a vector that does not have the interaction information between the forward and backward hidden states.[4] The hidden state in one direction contains only the contextual meaning in that particular direction, however a word's contextual representation, intuitively, is more accurate when collected and viewed from both directions. This paper argues that the failure to observe the meaning of a word in both directions causes the loss of the true meaning of the word, especially for polysemic words (words with more than one meaning) that are context-sensitive.<br />
<br />
== Paper Key Contributions ==<br />
<br />
This paper suggests an enhanced method of text classification by proposing a new way of combining Convolutional and Recurrent Neural Networks (CRNN) involving the addition of a neural tensor layer. The proposed method maintains each network's respective strengths that are normally lost in previous combination methods. The new suggested architecture is called CRNN, which utilizes both a CNN and RNN that run in parallel on the same input sentence. CNN uses weight matrix learning and produces a 2D matrix that shows the importance of each word based on local and position-invariant features. The bidirectional RNN produces a matrix that learns each word's contextual representation; the words' importance about to with concerning the rest of the sentence. <br />
<br />
A possible limitation of traditional convolutional layers is that they extract features from the data using a non-linearity applied on affine functions. The extracted feature is calculated by multiplying the weight matrix with the input data, summing up the products, adding a scalar, then employing a non-linear function. The purpose of the function is to achieve a better representation of the input data by using more layers. However, this cannot be achieved by stacking more convolutional layers because a consequent layer will not be able to extract more complex features using the output from the previous layer. <br />
<br />
Unlike CNN, CRNN introduces a neural tensor layer on top of the RNN to obtain the fusion of bi-directional contextual information surrounding a particular word. This method combines the weight matrix and the word representations together as well as classifies the text according to certain criteria, providing the important information of each word for prediction, which helps to explain the results. The model also uses dropout and L2 regularization to prevent overfitting, which in our case means only updating parts of the filters while the other features (columns in the matrix) remain zero.<br />
<br />
== CRNN Results vs Benchmarks ==<br />
<br />
In order to benchmark the performance of the CRNN model, as well as compare it to other previous efforts, multiple datasets and classification problems were used. All of these datasets are publicly available and can be easily downloaded by any user for testing.<br />
<br />
- '''Movie Reviews:''' a sentiment analysis dataset, with two classes (positive and negative).<br />
<br />
- '''Yelp:''' a sentiment analysis dataset, with five classes. For this test, a subset of 120,000 reviews was randomly chosen from each class for a total of 600,000 reviews.<br />
<br />
- '''AG's News:''' a news categorization dataset, using only the 4 largest classes from the dataset. There are 30000 training samples and 1900 test samples.<br />
<br />
- '''20 Newsgroups:''' a news categorization dataset, again using only 4 large classes (comp, politics, rec, and religion) from the dataset.<br />
<br />
- '''Sogou News:''' a Chinese news categorization dataset, using 10 major classes as a multi-class classification and include 6500 samples randomly from each class.<br />
<br />
- '''Yahoo! Answers:''' a topic classification dataset, with 10 classes and each class contains 140000 training samples and 5000 testing samples.<br />
<br />
For the English language datasets, the initial word representations were created using the publicly available ''word2vec'' [https://code.google.com/p/word2vec/] from Google news. For the Chinese language dataset, ''jieba'' [https://github.com/fxsjy/jieba] was used to segment sentences, and then 50-dimensional word vectors were trained on Chinese ''wikipedia'' using ''word2vec''.<br />
<br />
A number of other models are run against the same data after preprocessing. Some of these models include:<br />
<br />
- '''Self-attentive LSTM:''' an LSTM model with multi-hop attention for sentence embedding.<br />
<br />
- '''RCNN:''' the RCNN's recurrent structure allows for increased depth of capture for contextual information. Less noise is introduced on account of the model's holistic structure (compared to local features).<br />
<br />
The following results are obtained:<br />
<br />
[[File:table of results.png|550px|center]]<br />
<br />
The bold results represent the best performing model for a given dataset. These results show that the CRNN model is the best model for 4 of the 6 datasets, with the Self-attentive LSTM beating the CRNN by 0.03 and 0.12 on the news categorization problems. Considering that the CRNN model has better performance than the Self-attentive LSTM on the other 4 datasets, this suggests that the CRNN model is a better performer overall in the conditions of this benchmark.<br />
<br />
It should be noted that including the neural tensor layer in the CRNN model leads to a significant performance boost compared to the CRNN models without it. The performance boost can be attributed to the fact that the neural tensor layer captures the surrounding contextual information for each word, and brings this information between the forward and backward RNN in a direct method. As seen in the table, this leads to a better classification accuracy across all datasets.<br />
<br />
Another important result was that the CRNN model filter size impacted performance only in the sentiment analysis datasets, as seen in the following table, where the results for the AG's news, Yelp, and Yahoo! Answer datasets are displayed:<br />
<br />
[[File:filter_effects.png|550px|center]]<br />
<br />
== CRNN Model Architecture ==<br />
<br />
The CRNN model is a combination of RNN and CNN. It uses a CNN to compute the importance of each word in the text and utilizes a neural tensor layer to fuse forward and backward hidden states of bi-directional RNN.<br />
<br />
The input of the network is a text, which is a sequence of words. The output of the network is the text representation that is subsequently used as input of a fully-connected layer to obtain the class prediction.<br />
<br />
'''RNN Pipeline:'''<br />
<br />
The goal of the RNN pipeline is to input each word in a text, and retrieve the contextual information surrounding the word and compute the contextual representation of the word itself. This is accomplished by the use of a bi-directional RNN, such that a Neural Tensor Layer (NTL) can combine the results of the RNN to obtain the final output. RNNs are well-suited to NLP tasks because of their ability to sequentially process data such as ordered text.<br />
<br />
A RNN is similar to a feed-forward neural network, but it relies on the use of hidden states. Hidden states are layers in the neural net that produce two outputs: <math> \hat{y}_{t} </math> and <math> h_t </math>. For a time step <math> t </math>, <math> h_t </math> is fed back into the layer to compute <math> \hat{y}_{t+1} </math> and <math> h_{t+1} </math>. <br />
<br />
Traditional RNNs are only able to remember the most recent words in a sequence, which may be problematic since words that came at the beginning of the sequence that is important to the classification problem may be forgotten. This is because the error signal needs to travel backward through the whole pathway in backpropagation. Since the activations in the backward pass are entirely linear, these activation functions will "explode" if its largest eigenvalue is greater than one and "vanish" when its largest eigenvalue is less than one, which describes the problem of vanishing/exploding gradients (Grosse). The pipeline will actually use a variant of RNN called GRU, short for Gated Recurrent Units. This is done to address the vanishing gradient problem which causes the network to struggle to memorize words that came earlier in the sequence. A GRU attempts to solve this by controlling the flow of information through the network using update and reset gates. <br />
<br />
Let <math>h_{t-1} \in \mathbb{R}^m, x_t \in \mathbb{R}^d </math> be the inputs, and let <math>\mathbf{W}_z, \mathbf{W}_r, \mathbf{W}_h \in \mathbb{R}^{m \times d}, \mathbf{U}_z, \mathbf{U}_r, \mathbf{U}_h \in \mathbb{R}^{m \times m}</math> be trainable weight matrices. Then the following equations describe the update and reset gates:<br />
<br />
<br />
<math><br />
z_t = \sigma(\mathbf{W}_zx_t + \mathbf{U}_zh_{t-1}) \text{update gate} \\<br />
r_t = \sigma(\mathbf{W}_rx_t + \mathbf{U}_rh_{t-1}) \text{reset gate} \\<br />
\tilde{h}_t = \text{tanh}(\mathbf{W}_hx_t + r_t \circ \mathbf{U}_hh_{t-1}) \text{new memory} \\<br />
h_t = (1-z_t)\circ \tilde{h}_t + z_t\circ h_{t-1}<br />
</math><br />
<br />
<br />
Note that <math> \sigma, \text{tanh}, \circ </math> are all element-wise functions. The above equations do the following:<br />
<br />
<ol><br />
<li> <math>h_{t-1}</math> carries information from the previous iteration and <math>x_t</math> is the current input </li><br />
<li> the update gate <math>z_t</math> controls how much past information should be forwarded to the next hidden state </li><br />
<li> the rest gate <math>r_t</math> controls how much past information is forgotten or reset </li><br />
<li> new memory <math>\tilde{h}_t</math> contains the relevant past memory as instructed by <math>r_t</math> and current information from the input <math>x_t</math> </li><br />
<li> then <math>z_t</math> is used to control what is passed on from <math>h_{t-1}</math> and <math>(1-z_t)</math> controls the new memory that is passed on<br />
</ol><br />
<br />
We treat <math>h_0</math> and <math> h_{n+1} </math> as zero vectors in the method. Thus, each <math>h_t</math> can be computed as above to yield results for the bi-directional RNN. The resulting hidden states <math>\overrightarrow{h_t}</math> and <math>\overleftarrow{h_t}</math> contain contextual information around the <math> t</math>-th word in forward and backward directions respectively. Contrary to convention, instead of concatenating these two vectors, it is argued that the word's contextual representation is more precise when the context information from different directions is collected and fused using a neural tensor layer as it permits greater interactions among each element of hidden states. Using these two vectors as input to the neural tensor layer, <math>V^i </math>, we compute a new representation that aggregates meanings from the forward and backward hidden states more accurately as follows:<br />
<br />
<math> <br />
[\hat{h_t}]_i = tanh(\overrightarrow{h_t}V^i\overleftarrow{h_t} + b_i) <br />
</math><br />
<br />
Where <math>V^i \in \mathbb{R}^{m \times m} </math> is the learned tensor layer, and <math> b_i \in \mathbb{R} </math> is the bias.We repeat this <math> m </math> times with different <math>V^i </math> matrices and <math> b_i </math> vectors. Through the neural tensor layer, each element in <math> [\hat{h_t}]_i </math> can be viewed as a different type of intersection between the forward and backward hidden states. In the model, <math> [\hat{h_t}]_i </math> will have the same size as the forward and backward hidden states. We then concatenate the three hidden states vectors to form a new vector that summarizes the context information :<br />
<math><br />
\overleftrightarrow{h_t} = [\overrightarrow{h_t}^T,\overleftarrow{h_t}^T,\hat{h_t}]^T <br />
</math><br />
<br />
We calculate this vector for every word in the text and then stack them all into matrix <math> H </math> with shape <math>n</math>-by-<math>3m</math>.<br />
<br />
<math><br />
H = [\overleftrightarrow{h_1};...\overleftrightarrow{h_n}]<br />
</math><br />
<br />
This <math>H</math> matrix is then forwarded as the results from the Recurrent Neural Network.<br />
<br />
<br />
'''CNN Pipeline:'''<br />
<br />
The goal of the CNN pipeline is to learn the relative importance of words in an input sequence based on different aspects. The process of this CNN pipeline is summarized as the following steps:<br />
<br />
<ol><br />
<li> Given a sequence of words, each word is converted into a word vector using the word2vec algorithm which gives matrix X. <br />
</li><br />
<br />
<li> Word vectors are then convolved through the temporal dimension with filters of various sizes (ie. different K) with learnable weights to capture various numerical K-gram representations. These K-gram representations are stored in matrix C.<br />
</li><br />
<br />
<ul><br />
<li> The convolution makes this process capture local and position-invariant features. Local means the K words are contiguous. Position-invariant means K contiguous words at any position are detected in this case via convolution.<br />
<br />
<li> Temporal dimension example: convolve words from 1 to K, then convolve words 2 to K+1, etc<br />
</li><br />
</ul><br />
<br />
<li> Since not all K-gram representations are equally meaningful, there is a learnable matrix W which takes the linear combination of K-gram representations to more heavily weigh the more important K-gram representations for the classification task.<br />
</li><br />
<br />
<li> Each linear combination of the K-gram representations gives the relative word importance based on the aspect that the linear combination encodes.<br />
</li><br />
<br />
<li> The relative word importance vs aspect gives rise to an interpretable attention matrix A, where each element says the relative importance of a specific word for a specific aspect.<br />
</li><br />
<br />
</ol><br />
<br />
[[File:Group12_Figure1.png |center]]<br />
<br />
<div align="center">Figure 1: The architecture of CRNN.</div><br />
<br />
== Merging RNN & CNN Pipeline Outputs ==<br />
<br />
The results from both the RNN and CNN pipeline can be merged by simply multiplying the output matrices. That is, we compute <math>S=A^TH</math> which has shape <math>z \times 3m</math> and is essentially a linear combination of the hidden states. The concatenated rows of S results in a vector in <math>\mathbb{R}^{3zm}</math> and can be passed to a fully connected Softmax layer to output a vector of probabilities for our classification task. <br />
<br />
To train the model, we make the following decisions:<br />
<ul><br />
<li> Use cross-entropy loss as the loss function (A cross-entropy loss function usually takes in two distributions, a true distribution p and an estimated distribution q, and measures the average number of bits need to identify an event. This calculation is independent of the kind of layers used in the network as well as the kind of activation being implemented.) </li><br />
<li> Perform dropout on random columns in matrix C in the CNN pipeline </li><br />
<li> Perform L2 regularization on all parameters </li><br />
<li> Use stochastic gradient descent with a learning rate of 0.001 </li><br />
</ul><br />
<br />
== Interpreting Learned CRNN Weights ==<br />
<br />
Recall that attention matrix A essentially stores the relative importance of every word in the input sequence for every aspect chosen. Naturally, this means that A is an n-by-z matrix, with n being the number of words in the input sequence and z being the number of aspects considered in the classification task. <br />
<br />
Furthermore, for any specific aspect, words with higher attention values are more important relative to other words in the same input sequence. likewise, for any specific word, aspects with higher attention values prioritize the specific word more than other aspects.<br />
<br />
For example, in this paper, a sentence is sampled from the Movie Reviews dataset, and the transpose of attention matrix A is visualized. Each word represents an element in matrix A, the intensity of red represents the magnitude of an attention value in A, and each sentence is the relative importance of each word for a specific context. In the first row, the words are weighted in terms of a positive aspect, in the last row, the words are weighted in terms of a negative aspect, and in the middle row, the words are weighted in terms of a positive and negative aspect. Notice how the relative importance of words is a function of the aspect.<br />
<br />
[[File:Interpretation example.png|800px|center]]<br />
<br />
From the above sample, it is interesting that the word "but" is viewed as a negative aspect. From a linguistic perspective, the semantic of "but" is incredibly difficult to capture because of the degree of contextual information it needs. In this case, "but" is in the middle of a transition from a negative to a positive so the first row should also have given attention to that word. Also, it seems that the model has learned to give very high attention to the two words directly adjacent to the word of high attention: "is" and "and" beside "powerful", and "an" and "cast" beside "unwieldy".<br />
<br />
The paper also shows that the model can determine important words in the news. The authors take Ag's news dataset and randomly select 10 important words for each class. The table (shown below) contains eye-catching words which fit their classes well.<br />
<br />
[[File:news_words.png|480px|center]]<br />
<br />
== Conclusion & Summary ==<br />
<br />
This paper proposed a new architecture, the Convolutional Recurrent Neural Network, for text classification. The Convolutional Neural Network is used to learn the relative importance of each word from their different aspects and stores this information into a weight matrix. The Recurrent Neural Network learns each word's contextual representation through the combination of the forward and backward context information that is fused using a neural tensor layer and is stored as a matrix. These two matrices are then combined to get the text representation used for classification. Although the specifics of the performed tests are lacking, the experiment's results indicate that the proposed method performed well in comparison to most previous methods. In addition to performing well, the proposed method also provides insight into which words contribute greatly to the classification decision as to the learned matrix from the Convolutional Neural Network stores the relative importance of each word. This information can then be used in other applications or analyses. In the future, one can explore the features extracted from the model and use them to potentially learn new methods such as model space. [5]<br />
<br />
== Critiques ==<br />
<br />
1. In the '''Method''' section of the paper, some explanations used the same notation for multiple different elements of the model. This made the paper harder to follow and understand since they were referring to different elements by identical notation. Additionally, the decision to use sigmoid and hyperbolic tangent functions as nonlinearities for representation learning is not supported with evidence that these are optimal.<br />
<br />
2. In '''Comparison of Methods''', the authors discuss the range of hyperparameter settings that they search through. While some of the hyperparameters have a large range of search values, three parameters are fixed without much explanation as to why for all experiments, size of the hidden state of GRU, number of layers, and dropout. These parameters have a lot to do with the complexity of the model and this paper could be improved by providing relevant reasoning behind these values, or by providing additional experimental results over different values of these parameters.<br />
<br />
3. In the '''Results''' section of the paper, they tried to show that the classification results from the CRNN model can be better interpreted than other models. In these explanations, the details were lacking and the authors did not adequately demonstrate how their model is better than others.<br />
<br />
4. Finally, in the '''Results''' section again, the paper compares the CRNN model to several models which they did not implement and reproduce results with. This can be seen in the chart of results above, where several models do not have entries in the table for all six datasets. Since the authors used a subset of the datasets, these other models which were not reproduced could have different accuracy scores if they had been tested on the same data as the CRNN model. This difference in training and testing data is not mentioned in the paper, and the conclusion that the CRNN model is better in all cases may not be valid.<br />
<br />
5. Considering the general methodology, the author in the paper chose to fuse CNN with Gated recurrent unit (GRU), which is only one version of RNN. However, it has been shown that LSTM generally performs better than GRU, and the author should discuss their choice of using GRU in CRNN instead of LSTM in more detail. Looking at the authors' experimental results, LSTM even outperforms CRNN (with GRU) in some cases, which further motivates the idea of adopting LSTM for RNN to combine with CNN. Whether combining LSTM with CNN will lead to a better performance will of course need further verification, but in principle author should at least address the issue. <br />
<br />
6. This is an interesting method, I would be curious to see if this can be combined or compared with Quasi-Recurrent Neural Networks (https://arxiv.org/abs/1611.01576). In my experience, QRNNs perform similarly to LSTMs while running significantly faster using convolutions with a special temporal pooling. This seems compatible with the neural tensor layer proposed in this paper, which may be combined to yield stronger performance with faster runtimes.<br />
<br />
7. It would be interesting to see how the attention matrix is being constructed and how attention values are being determined in each matrix. For instance, does every different subject have its own attention matrix? If so, how will the situation be handled when the same attention matrix is used in different settings?<br />
<br />
8. The paper shows the CRNN model not performing the best with Ag's news and 20newsgroups. It would be interesting to investigate this in detail and see the difference in the way the data is handled in the model compared to the best performing model(self-attentive LSTM in both datasets).<br />
<br />
9. From the Interpreting Learned CRNN Weights part, the samples are labeled as positive and negative, and their words all have opposite emotional polarities. It can be observed that regardless of whether the polarity of the example is positive or negative, the keyword can be extracted by this method, reflecting that it can capture multiple semantically meaningful components. At the same time, it will be very interesting to see if this method is applicable to other specific categories.<br />
<br />
10. The authors of this paper provide 2 examples of what topic classification is, but do not provide any explicit examples of "polysemic words whose meanings are context-sensitive", one of their main critiques of current methods. This is an opportunity to promote the use of their method and engage and inform the reader, simply by listing examples of these words.<br />
<br />
11. In the "Interpreting Learned CRNN Weights" parts, authors gave a figure but did not explain whether this is an example of positive case or negative case. I suggest the author to label the figure. Also, in the paper, we can see there are 2 figures and both of the figures have been labelled as positive or negative but there is only one figure without labelled in the summary.<br />
<br />
12. In the CRNN Results vs Benchmarks, it would be better to provide more details and comparison of other methods other than CRNN. It would be clear and more intuitive for the readers why the author will choose CRNN rather than the others.<br />
<br />
13. The author should explore more on why labeling from certain data source is much more accurate than other data source.<br />
<br />
14. Nice visual illustration for explaining the whole process. More explanations can be done on how two neural networks are combined. Would the result be different if two neural networks were modelled sequentially?<br />
<br />
15. The author could provide more information on the training process. For instance, with the given configuration and learning rate, how long did it take for the model's training accuracy to converge. Also, it would be interesting to see the effects of using other types of units like LSTM.<br />
<br />
16. It would be more understandable if the authors can provide more details (maybe some visual results) of how to merge the 2 algorithms together!<br />
<br />
== Comments ==<br />
<br />
- Could this be applied to ancient languages such as hieroglyphs to decipher/better understand them?<br />
<br />
- Another application for CRNN might be classifying spoken language as well as any form of audio.<br />
<br />
-I think it will be better to show more results by using this method. Maybe it will be better to put the result part after the architecture part? Writing a motivation will be better since it will catch readers' "eyes". I think it will be interesting to ask: whether can we apply this to ancient Chinese poetry? Since there are lots of types of ancient Chinese poetry, doing a classification for them will be interesting.<br />
<br />
- In another [https://www.aclweb.org/anthology/W99-0908/ paper] written by Andrew McCallum and Kamal Nigam, they introduce a different method of text classification. Namely, instead of a combination of recurrent and convolutional neural networks, they instead utilized bootstrapping with keywords, Expectation-Maximization algorithm, and shrinkage.<br />
<br />
- The author could add more comparison of text classification with other algorithms that have the same functionalities such as compare CRNN with clustering of text to do classification.<br />
<br />
== References ==<br />
----<br />
<br />
[1] Grimes, Seth. “Unstructured Data and the 80 Percent Rule.” Breakthrough Analysis, 1 Aug. 2008, breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/.<br />
<br />
[2] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modeling sentences,”<br />
arXiv preprint arXiv:1404.2188, 2014.<br />
<br />
[3] K. Cho, B. V. Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning<br />
phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint<br />
arXiv:1406.1078, 2014.<br />
<br />
[4] Keren, G., &amp; Schuller, B. (2016, February 18). Convolutional RNN: An Enhanced Model for Extracting Features from Sequential Data. Retrieved November 30, 2020, from https://arxiv.org/pdf/1602.05875.pdf.<br />
<br />
[5] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in Proceedings<br />
of AAAI, 2015, pp. 2267–2273.<br />
<br />
[6] H. Chen, P. Tio, A. Rodan, and X. Yao, “Learning in the model space for cognitive fault diagnosis,” IEEE<br />
Transactions on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 124–136, 2014.<br />
<br />
[7] Grosse, R. Lecture 15: Exploding and Vanishing Gradients. Lecture presented at CSC 321 in University of Toronto, Toronto. Retrieved from http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/readings/L15 Exploding and Vanishing Gradients.pdf&usg=AOvVaw0tbIdxkUR1WsUtyIf4HVSn</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Cvmustat&diff=49668User:Cvmustat2020-12-07T03:14:08Z<p>X277lu: /* Introduction */</p>
<hr />
<div><br />
== Combine Convolution with Recurrent Networks for Text Classification == <br />
'''Team Members''': Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea<br />
<br />
'''Date''': Week of Nov 23 <br />
<br />
== Introduction ==<br />
<br />
<br />
Text classification is the task of assigning a set of predefined categories to natural language texts. It involves learning an embedding layer which allows context-dependent classification. It is a fundamental task in Natural Language Processing (NLP) with various applications such as sentiment analysis and topic classification. A classic example involving text classification is given a set of News articles, is whether it is possible to classify the genre or subject of each article. Text classification is useful as text data is a rich source of information, but extracting insights from it directly can be difficult and time-consuming as most text data is unstructured.[1] NLP text classification can help automatically structure and analyze text quickly and cost-effectively, allowing for individuals to extract important features from the text easier than before. <br />
<br />
Text classification work mainly focuses on three topics: feature engineering, feature selection, and the use of different types of machine learning algorithms.<br />
:1. [[Feature engineering]]: the most widely used feature is the bag of words feature. Some more complex functions are also designed, such as part-of-speech tags, noun phrases, and tree kernels.<br />
:2. [[Feature selection]]: aims to remove noisy features and improve classification performance. The most common feature selection method is to delete stop words.<br />
:3. [[Machine learning algorithms]]: usually use classifiers, such as Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machine (SVM).<br />
<br />
In practice, pre-trained word embeddings and deep neural networks are used together for NLP text classification. Word embeddings are used to map the raw text data to an implicit space where the semantic relationships of the words are preserved; words with similar meaning have a similar representation. One can then feed these embeddings into deep neural networks to learn different features of the text. Convolutional neural networks can be used to determine the semantic composition of the text(the meaning), as it treats texts as a 2D matrix by concatenating the embedding of words together. It uses a 1D convolution operator to perform the feature mapping and then conducts a 1D pooling operation over the time domain for obtaining a fixed-length output feature vector, and it can capture both local and position invariant features of the text.[2] Alternatively, Recurrent Neural Networks can be used to determine the contextual meaning of each word in the text (how each word relates to one another) by treating the text as sequential data and then analyzing each word separately. [3] Previous approaches to attempt to combine these two neural networks to incorporate the advantages of both models involve streamlining the two networks, which might decrease their performance. Besides, most methods incorporating a bi-directional Recurrent Neural Network usually concatenate the forward and backward hidden states at each time step, which results in a vector that does not have the interaction information between the forward and backward hidden states.[4] The hidden state in one direction contains only the contextual meaning in that particular direction, however a word's contextual representation, intuitively, is more accurate when collected and viewed from both directions. This paper argues that the failure to observe the meaning of a word in both directions causes the loss of the true meaning of the word, especially for polysemic words (words with more than one meaning) that are context-sensitive.<br />
<br />
== Paper Key Contributions ==<br />
<br />
This paper suggests an enhanced method of text classification by proposing a new way of combining Convolutional and Recurrent Neural Networks (CRNN) involving the addition of a neural tensor layer. The proposed method maintains each network's respective strengths that are normally lost in previous combination methods. The new suggested architecture is called CRNN, which utilizes both a CNN and RNN that run in parallel on the same input sentence. CNN uses weight matrix learning and produces a 2D matrix that shows the importance of each word based on local and position-invariant features. The bidirectional RNN produces a matrix that learns each word's contextual representation; the words' importance about to with concerning the rest of the sentence. <br />
<br />
A possible limitation of traditional convolutional layers is that they extract features from the data using a non-linearity applied on affine functions. The extracted feature is calculated by multiplying the weight matrix with the input data, summing up the products, adding a scalar, then employing a non-linear function. The purpose of the function is to achieve a better representation of the input data by using more layers. However, this cannot be achieved by stacking more convolutional layers because a consequent layer will not be able to extract more complex features using the output from the previous layer. <br />
<br />
Unlike CNN, CRNN introduces a neural tensor layer on top of the RNN to obtain the fusion of bi-directional contextual information surrounding a particular word. This method combines the weight matrix and the word representations together as well as classifies the text according to certain criteria, providing the important information of each word for prediction, which helps to explain the results. The model also uses dropout and L2 regularization to prevent overfitting, which in our case means only updating parts of the filters while the other features (columns in the matrix) remain zero.<br />
<br />
== CRNN Results vs Benchmarks ==<br />
<br />
In order to benchmark the performance of the CRNN model, as well as compare it to other previous efforts, multiple datasets and classification problems were used. All of these datasets are publicly available and can be easily downloaded by any user for testing.<br />
<br />
- '''Movie Reviews:''' a sentiment analysis dataset, with two classes (positive and negative).<br />
<br />
- '''Yelp:''' a sentiment analysis dataset, with five classes. For this test, a subset of 120,000 reviews was randomly chosen from each class for a total of 600,000 reviews.<br />
<br />
- '''AG's News:''' a news categorization dataset, using only the 4 largest classes from the dataset. There are 30000 training samples and 1900 test samples.<br />
<br />
- '''20 Newsgroups:''' a news categorization dataset, again using only 4 large classes (comp, politics, rec, and religion) from the dataset.<br />
<br />
- '''Sogou News:''' a Chinese news categorization dataset, using 10 major classes as a multi-class classification and include 6500 samples randomly from each class.<br />
<br />
- '''Yahoo! Answers:''' a topic classification dataset, with 10 classes and each class contains 140000 training samples and 5000 testing samples.<br />
<br />
For the English language datasets, the initial word representations were created using the publicly available ''word2vec'' [https://code.google.com/p/word2vec/] from Google news. For the Chinese language dataset, ''jieba'' [https://github.com/fxsjy/jieba] was used to segment sentences, and then 50-dimensional word vectors were trained on Chinese ''wikipedia'' using ''word2vec''.<br />
<br />
A number of other models are run against the same data after preprocessing. Some of these models include:<br />
<br />
- '''Self-attentive LSTM:''' an LSTM model with multi-hop attention for sentence embedding.<br />
<br />
- '''RCNN:''' the RCNN's recurrent structure allows for increased depth of capture for contextual information. Less noise is introduced on account of the model's holistic structure (compared to local features).<br />
<br />
The following results are obtained:<br />
<br />
[[File:table of results.png|550px|center]]<br />
<br />
The bold results represent the best performing model for a given dataset. These results show that the CRNN model is the best model for 4 of the 6 datasets, with the Self-attentive LSTM beating the CRNN by 0.03 and 0.12 on the news categorization problems. Considering that the CRNN model has better performance than the Self-attentive LSTM on the other 4 datasets, this suggests that the CRNN model is a better performer overall in the conditions of this benchmark.<br />
<br />
It should be noted that including the neural tensor layer in the CRNN model leads to a significant performance boost compared to the CRNN models without it. The performance boost can be attributed to the fact that the neural tensor layer captures the surrounding contextual information for each word, and brings this information between the forward and backward RNN in a direct method. As seen in the table, this leads to a better classification accuracy across all datasets.<br />
<br />
Another important result was that the CRNN model filter size impacted performance only in the sentiment analysis datasets, as seen in the following table, where the results for the AG's news, Yelp, and Yahoo! Answer datasets are displayed:<br />
<br />
[[File:filter_effects.png|550px|center]]<br />
<br />
== CRNN Model Architecture ==<br />
<br />
The CRNN model is a combination of RNN and CNN. It uses a CNN to compute the importance of each word in the text and utilizes a neural tensor layer to fuse forward and backward hidden states of bi-directional RNN.<br />
<br />
The input of the network is a text, which is a sequence of words. The output of the network is the text representation that is subsequently used as input of a fully-connected layer to obtain the class prediction.<br />
<br />
'''RNN Pipeline:'''<br />
<br />
The goal of the RNN pipeline is to input each word in a text, and retrieve the contextual information surrounding the word and compute the contextual representation of the word itself. This is accomplished by the use of a bi-directional RNN, such that a Neural Tensor Layer (NTL) can combine the results of the RNN to obtain the final output. RNNs are well-suited to NLP tasks because of their ability to sequentially process data such as ordered text.<br />
<br />
A RNN is similar to a feed-forward neural network, but it relies on the use of hidden states. Hidden states are layers in the neural net that produce two outputs: <math> \hat{y}_{t} </math> and <math> h_t </math>. For a time step <math> t </math>, <math> h_t </math> is fed back into the layer to compute <math> \hat{y}_{t+1} </math> and <math> h_{t+1} </math>. <br />
<br />
Traditional RNNs are only able to remember the most recent words in a sequence, which may be problematic since words that came at the beginning of the sequence that is important to the classification problem may be forgotten. This is because the error signal needs to travel backward through the whole pathway in backpropagation. Since the activations in the backward pass are entirely linear, these activation functions will "explode" if its largest eigenvalue is greater than one and "vanish" when its largest eigenvalue is less than one, which describes the problem of vanishing/exploding gradients (Grosse). The pipeline will actually use a variant of RNN called GRU, short for Gated Recurrent Units. This is done to address the vanishing gradient problem which causes the network to struggle to memorize words that came earlier in the sequence. A GRU attempts to solve this by controlling the flow of information through the network using update and reset gates. <br />
<br />
Let <math>h_{t-1} \in \mathbb{R}^m, x_t \in \mathbb{R}^d </math> be the inputs, and let <math>\mathbf{W}_z, \mathbf{W}_r, \mathbf{W}_h \in \mathbb{R}^{m \times d}, \mathbf{U}_z, \mathbf{U}_r, \mathbf{U}_h \in \mathbb{R}^{m \times m}</math> be trainable weight matrices. Then the following equations describe the update and reset gates:<br />
<br />
<br />
<math><br />
z_t = \sigma(\mathbf{W}_zx_t + \mathbf{U}_zh_{t-1}) \text{update gate} \\<br />
r_t = \sigma(\mathbf{W}_rx_t + \mathbf{U}_rh_{t-1}) \text{reset gate} \\<br />
\tilde{h}_t = \text{tanh}(\mathbf{W}_hx_t + r_t \circ \mathbf{U}_hh_{t-1}) \text{new memory} \\<br />
h_t = (1-z_t)\circ \tilde{h}_t + z_t\circ h_{t-1}<br />
</math><br />
<br />
<br />
Note that <math> \sigma, \text{tanh}, \circ </math> are all element-wise functions. The above equations do the following:<br />
<br />
<ol><br />
<li> <math>h_{t-1}</math> carries information from the previous iteration and <math>x_t</math> is the current input </li><br />
<li> the update gate <math>z_t</math> controls how much past information should be forwarded to the next hidden state </li><br />
<li> the rest gate <math>r_t</math> controls how much past information is forgotten or reset </li><br />
<li> new memory <math>\tilde{h}_t</math> contains the relevant past memory as instructed by <math>r_t</math> and current information from the input <math>x_t</math> </li><br />
<li> then <math>z_t</math> is used to control what is passed on from <math>h_{t-1}</math> and <math>(1-z_t)</math> controls the new memory that is passed on<br />
</ol><br />
<br />
We treat <math>h_0</math> and <math> h_{n+1} </math> as zero vectors in the method. Thus, each <math>h_t</math> can be computed as above to yield results for the bi-directional RNN. The resulting hidden states <math>\overrightarrow{h_t}</math> and <math>\overleftarrow{h_t}</math> contain contextual information around the <math> t</math>-th word in forward and backward directions respectively. Contrary to convention, instead of concatenating these two vectors, it is argued that the word's contextual representation is more precise when the context information from different directions is collected and fused using a neural tensor layer as it permits greater interactions among each element of hidden states. Using these two vectors as input to the neural tensor layer, <math>V^i </math>, we compute a new representation that aggregates meanings from the forward and backward hidden states more accurately as follows:<br />
<br />
<math> <br />
[\hat{h_t}]_i = tanh(\overrightarrow{h_t}V^i\overleftarrow{h_t} + b_i) <br />
</math><br />
<br />
Where <math>V^i \in \mathbb{R}^{m \times m} </math> is the learned tensor layer, and <math> b_i \in \mathbb{R} </math> is the bias.We repeat this <math> m </math> times with different <math>V^i </math> matrices and <math> b_i </math> vectors. Through the neural tensor layer, each element in <math> [\hat{h_t}]_i </math> can be viewed as a different type of intersection between the forward and backward hidden states. In the model, <math> [\hat{h_t}]_i </math> will have the same size as the forward and backward hidden states. We then concatenate the three hidden states vectors to form a new vector that summarizes the context information :<br />
<math><br />
\overleftrightarrow{h_t} = [\overrightarrow{h_t}^T,\overleftarrow{h_t}^T,\hat{h_t}]^T <br />
</math><br />
<br />
We calculate this vector for every word in the text and then stack them all into matrix <math> H </math> with shape <math>n</math>-by-<math>3m</math>.<br />
<br />
<math><br />
H = [\overleftrightarrow{h_1};...\overleftrightarrow{h_n}]<br />
</math><br />
<br />
This <math>H</math> matrix is then forwarded as the results from the Recurrent Neural Network.<br />
<br />
<br />
'''CNN Pipeline:'''<br />
<br />
The goal of the CNN pipeline is to learn the relative importance of words in an input sequence based on different aspects. The process of this CNN pipeline is summarized as the following steps:<br />
<br />
<ol><br />
<li> Given a sequence of words, each word is converted into a word vector using the word2vec algorithm which gives matrix X. <br />
</li><br />
<br />
<li> Word vectors are then convolved through the temporal dimension with filters of various sizes (ie. different K) with learnable weights to capture various numerical K-gram representations. These K-gram representations are stored in matrix C.<br />
</li><br />
<br />
<ul><br />
<li> The convolution makes this process capture local and position-invariant features. Local means the K words are contiguous. Position-invariant means K contiguous words at any position are detected in this case via convolution.<br />
<br />
<li> Temporal dimension example: convolve words from 1 to K, then convolve words 2 to K+1, etc<br />
</li><br />
</ul><br />
<br />
<li> Since not all K-gram representations are equally meaningful, there is a learnable matrix W which takes the linear combination of K-gram representations to more heavily weigh the more important K-gram representations for the classification task.<br />
</li><br />
<br />
<li> Each linear combination of the K-gram representations gives the relative word importance based on the aspect that the linear combination encodes.<br />
</li><br />
<br />
<li> The relative word importance vs aspect gives rise to an interpretable attention matrix A, where each element says the relative importance of a specific word for a specific aspect.<br />
</li><br />
<br />
</ol><br />
<br />
[[File:Group12_Figure1.png |center]]<br />
<br />
<div align="center">Figure 1: The architecture of CRNN.</div><br />
<br />
== Merging RNN & CNN Pipeline Outputs ==<br />
<br />
The results from both the RNN and CNN pipeline can be merged by simply multiplying the output matrices. That is, we compute <math>S=A^TH</math> which has shape <math>z \times 3m</math> and is essentially a linear combination of the hidden states. The concatenated rows of S results in a vector in <math>\mathbb{R}^{3zm}</math> and can be passed to a fully connected Softmax layer to output a vector of probabilities for our classification task. <br />
<br />
To train the model, we make the following decisions:<br />
<ul><br />
<li> Use cross-entropy loss as the loss function (A cross-entropy loss function usually takes in two distributions, a true distribution p and an estimated distribution q, and measures the average number of bits need to identify an event. This calculation is independent of the kind of layers used in the network as well as the kind of activation being implemented.) </li><br />
<li> Perform dropout on random columns in matrix C in the CNN pipeline </li><br />
<li> Perform L2 regularization on all parameters </li><br />
<li> Use stochastic gradient descent with a learning rate of 0.001 </li><br />
</ul><br />
<br />
== Interpreting Learned CRNN Weights ==<br />
<br />
Recall that attention matrix A essentially stores the relative importance of every word in the input sequence for every aspect chosen. Naturally, this means that A is an n-by-z matrix, with n being the number of words in the input sequence and z being the number of aspects considered in the classification task. <br />
<br />
Furthermore, for any specific aspect, words with higher attention values are more important relative to other words in the same input sequence. likewise, for any specific word, aspects with higher attention values prioritize the specific word more than other aspects.<br />
<br />
For example, in this paper, a sentence is sampled from the Movie Reviews dataset, and the transpose of attention matrix A is visualized. Each word represents an element in matrix A, the intensity of red represents the magnitude of an attention value in A, and each sentence is the relative importance of each word for a specific context. In the first row, the words are weighted in terms of a positive aspect, in the last row, the words are weighted in terms of a negative aspect, and in the middle row, the words are weighted in terms of a positive and negative aspect. Notice how the relative importance of words is a function of the aspect.<br />
<br />
[[File:Interpretation example.png|800px|center]]<br />
<br />
From the above sample, it is interesting that the word "but" is viewed as a negative aspect. From a linguistic perspective, the semantic of "but" is incredibly difficult to capture because of the degree of contextual information it needs. In this case, "but" is in the middle of a transition from a negative to a positive so the first row should also have given attention to that word. Also, it seems that the model has learned to give very high attention to the two words directly adjacent to the word of high attention: "is" and "and" beside "powerful", and "an" and "cast" beside "unwieldy".<br />
<br />
The paper also shows that the model can determine important words in the news. The authors take Ag's news dataset and randomly select 10 important words for each class. The table (shown below) contains eye-catching words which fit their classes well.<br />
<br />
[[File:news_words.png|480px|center]]<br />
<br />
== Conclusion & Summary ==<br />
<br />
This paper proposed a new architecture, the Convolutional Recurrent Neural Network, for text classification. The Convolutional Neural Network is used to learn the relative importance of each word from their different aspects and stores this information into a weight matrix. The Recurrent Neural Network learns each word's contextual representation through the combination of the forward and backward context information that is fused using a neural tensor layer and is stored as a matrix. These two matrices are then combined to get the text representation used for classification. Although the specifics of the performed tests are lacking, the experiment's results indicate that the proposed method performed well in comparison to most previous methods. In addition to performing well, the proposed method also provides insight into which words contribute greatly to the classification decision as to the learned matrix from the Convolutional Neural Network stores the relative importance of each word. This information can then be used in other applications or analyses. In the future, one can explore the features extracted from the model and use them to potentially learn new methods such as model space. [5]<br />
<br />
== Critiques ==<br />
<br />
1. In the '''Method''' section of the paper, some explanations used the same notation for multiple different elements of the model. This made the paper harder to follow and understand since they were referring to different elements by identical notation. Additionally, the decision to use sigmoid and hyperbolic tangent functions as nonlinearities for representation learning is not supported with evidence that these are optimal.<br />
<br />
2. In '''Comparison of Methods''', the authors discuss the range of hyperparameter settings that they search through. While some of the hyperparameters have a large range of search values, three parameters are fixed without much explanation as to why for all experiments, size of the hidden state of GRU, number of layers, and dropout. These parameters have a lot to do with the complexity of the model and this paper could be improved by providing relevant reasoning behind these values, or by providing additional experimental results over different values of these parameters.<br />
<br />
3. In the '''Results''' section of the paper, they tried to show that the classification results from the CRNN model can be better interpreted than other models. In these explanations, the details were lacking and the authors did not adequately demonstrate how their model is better than others.<br />
<br />
4. Finally, in the '''Results''' section again, the paper compares the CRNN model to several models which they did not implement and reproduce results with. This can be seen in the chart of results above, where several models do not have entries in the table for all six datasets. Since the authors used a subset of the datasets, these other models which were not reproduced could have different accuracy scores if they had been tested on the same data as the CRNN model. This difference in training and testing data is not mentioned in the paper, and the conclusion that the CRNN model is better in all cases may not be valid.<br />
<br />
5. Considering the general methodology, the author in the paper chose to fuse CNN with Gated recurrent unit (GRU), which is only one version of RNN. However, it has been shown that LSTM generally performs better than GRU, and the author should discuss their choice of using GRU in CRNN instead of LSTM in more detail. Looking at the authors' experimental results, LSTM even outperforms CRNN (with GRU) in some cases, which further motivates the idea of adopting LSTM for RNN to combine with CNN. Whether combining LSTM with CNN will lead to a better performance will of course need further verification, but in principle author should at least address the issue. <br />
<br />
6. This is an interesting method, I would be curious to see if this can be combined or compared with Quasi-Recurrent Neural Networks (https://arxiv.org/abs/1611.01576). In my experience, QRNNs perform similarly to LSTMs while running significantly faster using convolutions with a special temporal pooling. This seems compatible with the neural tensor layer proposed in this paper, which may be combined to yield stronger performance with faster runtimes.<br />
<br />
7. It would be interesting to see how the attention matrix is being constructed and how attention values are being determined in each matrix. For instance, does every different subject have its own attention matrix? If so, how will the situation be handled when the same attention matrix is used in different settings?<br />
<br />
8. The paper shows the CRNN model not performing the best with Ag's news and 20newsgroups. It would be interesting to investigate this in detail and see the difference in the way the data is handled in the model compared to the best performing model(self-attentive LSTM in both datasets).<br />
<br />
9. From the Interpreting Learned CRNN Weights part, the samples are labeled as positive and negative, and their words all have opposite emotional polarities. It can be observed that regardless of whether the polarity of the example is positive or negative, the keyword can be extracted by this method, reflecting that it can capture multiple semantically meaningful components. At the same time, it will be very interesting to see if this method is applicable to other specific categories.<br />
<br />
10. The authors of this paper provide 2 examples of what topic classification is, but do not provide any explicit examples of "polysemic words whose meanings are context-sensitive", one of their main critiques of current methods. This is an opportunity to promote the use of their method and engage and inform the reader, simply by listing examples of these words.<br />
<br />
11. In the "Interpreting Learned CRNN Weights" parts, authors gave a figure but did not explain whether this is an example of positive case or negative case. I suggest the author to label the figure. Also, in the paper, we can see there are 2 figures and both of the figures have been labelled as positive or negative but there is only one figure without labelled in the summary.<br />
<br />
12. In the CRNN Results vs Benchmarks, it would be better to provide more details and comparison of other methods other than CRNN. It would be clear and more intuitive for the readers why the author will choose CRNN rather than the others.<br />
<br />
13. The author should explore more on why labeling from certain data source is much more accurate than other data source.<br />
<br />
14. Nice visual illustration for explaining the whole process. More explanations can be done on how two neural networks are combined. Would the result be different if two neural networks were modelled sequentially?<br />
<br />
15. The author could provide more information on the training process. For instance, with the given configuration and learning rate, how long did it take for the model's training accuracy to converge. Also, it would be interesting to see the effects of using other types of units like LSTM.<br />
<br />
16. It would be more understandable if the authors can provide more details (maybe some visual results) of how to merge the 2 algorithms together!<br />
<br />
== Comments ==<br />
<br />
- Could this be applied to ancient languages such as hieroglyphs to decipher/better understand them?<br />
<br />
- Another application for CRNN might be classifying spoken language as well as any form of audio.<br />
<br />
-I think it will be better to show more results by using this method. Maybe it will be better to put the result part after the architecture part? Writing a motivation will be better since it will catch readers' "eyes". I think it will be interesting to ask: whether can we apply this to ancient Chinese poetry? Since there are lots of types of ancient Chinese poetry, doing a classification for them will be interesting.<br />
<br />
- In another [https://www.aclweb.org/anthology/W99-0908/ paper] written by Andrew McCallum and Kamal Nigam, they introduce a different method of text classification. Namely, instead of a combination of recurrent and convolutional neural networks, they instead utilized bootstrapping with keywords, Expectation-Maximization algorithm, and shrinkage.<br />
<br />
- The author could add more comparison of text classification with other algorithms that have the same functionalities such as compare CRNN with clustering of text to do classification.<br />
<br />
== References ==<br />
----<br />
<br />
[1] Grimes, Seth. “Unstructured Data and the 80 Percent Rule.” Breakthrough Analysis, 1 Aug. 2008, breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/.<br />
<br />
[2] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modeling sentences,”<br />
arXiv preprint arXiv:1404.2188, 2014.<br />
<br />
[3] K. Cho, B. V. Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning<br />
phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint<br />
arXiv:1406.1078, 2014.<br />
<br />
[4] Keren, G., &amp; Schuller, B. (2016, February 18). Convolutional RNN: An Enhanced Model for Extracting Features from Sequential Data. Retrieved November 30, 2020, from https://arxiv.org/pdf/1602.05875.pdf.<br />
<br />
[5] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in Proceedings<br />
of AAAI, 2015, pp. 2267–2273.<br />
<br />
[6] H. Chen, P. Tio, A. Rodan, and X. Yao, “Learning in the model space for cognitive fault diagnosis,” IEEE<br />
Transactions on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 124–136, 2014.<br />
<br />
[7] Grosse, R. Lecture 15: Exploding and Vanishing Gradients. Lecture presented at CSC 321 in University of Toronto, Toronto. Retrieved from http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/readings/L15 Exploding and Vanishing Gradients.pdf&usg=AOvVaw0tbIdxkUR1WsUtyIf4HVSn</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Cvmustat&diff=49667User:Cvmustat2020-12-07T03:13:05Z<p>X277lu: /* Critiques */</p>
<hr />
<div><br />
== Combine Convolution with Recurrent Networks for Text Classification == <br />
'''Team Members''': Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea<br />
<br />
'''Date''': Week of Nov 23 <br />
<br />
== Introduction ==<br />
<br />
<br />
Text classification is the task of assigning a set of predefined categories to natural language texts. It involves learning an embedding layer which allows context-dependent classification. It is a fundamental task in Natural Language Processing (NLP) with various applications such as sentiment analysis and topic classification. A classic example involving text classification is given a set of News articles, is whether it is possible to classify the genre or subject of each article. Text classification is useful as text data is a rich source of information, but extracting insights from it directly can be difficult and time-consuming as most text data is unstructured.[1] NLP text classification can help automatically structure and analyze text quickly and cost-effectively, allowing for individuals to extract important features from the text easier than before. <br />
<br />
Text classification work mainly focuses on three topics: feature engineering, feature selection, and the use of different types of machine learning algorithms.<br />
:1. Feature engineering, the most widely used feature is the bag of words feature. Some more complex functions are also designed, such as part-of-speech tags, noun phrases, and tree kernels.<br />
:2. Feature selection aims to remove noisy features and improve classification performance. The most common feature selection method is to delete stop words.<br />
:3. Machine learning algorithms usually use classifiers, such as Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machine (SVM).<br />
<br />
In practice, pre-trained word embeddings and deep neural networks are used together for NLP text classification. Word embeddings are used to map the raw text data to an implicit space where the semantic relationships of the words are preserved; words with similar meaning have a similar representation. One can then feed these embeddings into deep neural networks to learn different features of the text. Convolutional neural networks can be used to determine the semantic composition of the text(the meaning), as it treats texts as a 2D matrix by concatenating the embedding of words together. It uses a 1D convolution operator to perform the feature mapping and then conducts a 1D pooling operation over the time domain for obtaining a fixed-length output feature vector, and it can capture both local and position invariant features of the text.[2] Alternatively, Recurrent Neural Networks can be used to determine the contextual meaning of each word in the text (how each word relates to one another) by treating the text as sequential data and then analyzing each word separately. [3] Previous approaches to attempt to combine these two neural networks to incorporate the advantages of both models involve streamlining the two networks, which might decrease their performance. Besides, most methods incorporating a bi-directional Recurrent Neural Network usually concatenate the forward and backward hidden states at each time step, which results in a vector that does not have the interaction information between the forward and backward hidden states.[4] The hidden state in one direction contains only the contextual meaning in that particular direction, however a word's contextual representation, intuitively, is more accurate when collected and viewed from both directions. This paper argues that the failure to observe the meaning of a word in both directions causes the loss of the true meaning of the word, especially for polysemic words (words with more than one meaning) that are context-sensitive.<br />
<br />
== Paper Key Contributions ==<br />
<br />
This paper suggests an enhanced method of text classification by proposing a new way of combining Convolutional and Recurrent Neural Networks (CRNN) involving the addition of a neural tensor layer. The proposed method maintains each network's respective strengths that are normally lost in previous combination methods. The new suggested architecture is called CRNN, which utilizes both a CNN and RNN that run in parallel on the same input sentence. CNN uses weight matrix learning and produces a 2D matrix that shows the importance of each word based on local and position-invariant features. The bidirectional RNN produces a matrix that learns each word's contextual representation; the words' importance about to with concerning the rest of the sentence. <br />
<br />
A possible limitation of traditional convolutional layers is that they extract features from the data using a non-linearity applied on affine functions. The extracted feature is calculated by multiplying the weight matrix with the input data, summing up the products, adding a scalar, then employing a non-linear function. The purpose of the function is to achieve a better representation of the input data by using more layers. However, this cannot be achieved by stacking more convolutional layers because a consequent layer will not be able to extract more complex features using the output from the previous layer. <br />
<br />
Unlike CNN, CRNN introduces a neural tensor layer on top of the RNN to obtain the fusion of bi-directional contextual information surrounding a particular word. This method combines the weight matrix and the word representations together as well as classifies the text according to certain criteria, providing the important information of each word for prediction, which helps to explain the results. The model also uses dropout and L2 regularization to prevent overfitting, which in our case means only updating parts of the filters while the other features (columns in the matrix) remain zero.<br />
<br />
== CRNN Results vs Benchmarks ==<br />
<br />
In order to benchmark the performance of the CRNN model, as well as compare it to other previous efforts, multiple datasets and classification problems were used. All of these datasets are publicly available and can be easily downloaded by any user for testing.<br />
<br />
- '''Movie Reviews:''' a sentiment analysis dataset, with two classes (positive and negative).<br />
<br />
- '''Yelp:''' a sentiment analysis dataset, with five classes. For this test, a subset of 120,000 reviews was randomly chosen from each class for a total of 600,000 reviews.<br />
<br />
- '''AG's News:''' a news categorization dataset, using only the 4 largest classes from the dataset. There are 30000 training samples and 1900 test samples.<br />
<br />
- '''20 Newsgroups:''' a news categorization dataset, again using only 4 large classes (comp, politics, rec, and religion) from the dataset.<br />
<br />
- '''Sogou News:''' a Chinese news categorization dataset, using 10 major classes as a multi-class classification and include 6500 samples randomly from each class.<br />
<br />
- '''Yahoo! Answers:''' a topic classification dataset, with 10 classes and each class contains 140000 training samples and 5000 testing samples.<br />
<br />
For the English language datasets, the initial word representations were created using the publicly available ''word2vec'' [https://code.google.com/p/word2vec/] from Google news. For the Chinese language dataset, ''jieba'' [https://github.com/fxsjy/jieba] was used to segment sentences, and then 50-dimensional word vectors were trained on Chinese ''wikipedia'' using ''word2vec''.<br />
<br />
A number of other models are run against the same data after preprocessing. Some of these models include:<br />
<br />
- '''Self-attentive LSTM:''' an LSTM model with multi-hop attention for sentence embedding.<br />
<br />
- '''RCNN:''' the RCNN's recurrent structure allows for increased depth of capture for contextual information. Less noise is introduced on account of the model's holistic structure (compared to local features).<br />
<br />
The following results are obtained:<br />
<br />
[[File:table of results.png|550px|center]]<br />
<br />
The bold results represent the best performing model for a given dataset. These results show that the CRNN model is the best model for 4 of the 6 datasets, with the Self-attentive LSTM beating the CRNN by 0.03 and 0.12 on the news categorization problems. Considering that the CRNN model has better performance than the Self-attentive LSTM on the other 4 datasets, this suggests that the CRNN model is a better performer overall in the conditions of this benchmark.<br />
<br />
It should be noted that including the neural tensor layer in the CRNN model leads to a significant performance boost compared to the CRNN models without it. The performance boost can be attributed to the fact that the neural tensor layer captures the surrounding contextual information for each word, and brings this information between the forward and backward RNN in a direct method. As seen in the table, this leads to a better classification accuracy across all datasets.<br />
<br />
Another important result was that the CRNN model filter size impacted performance only in the sentiment analysis datasets, as seen in the following table, where the results for the AG's news, Yelp, and Yahoo! Answer datasets are displayed:<br />
<br />
[[File:filter_effects.png|550px|center]]<br />
<br />
== CRNN Model Architecture ==<br />
<br />
The CRNN model is a combination of RNN and CNN. It uses a CNN to compute the importance of each word in the text and utilizes a neural tensor layer to fuse forward and backward hidden states of bi-directional RNN.<br />
<br />
The input of the network is a text, which is a sequence of words. The output of the network is the text representation that is subsequently used as input of a fully-connected layer to obtain the class prediction.<br />
<br />
'''RNN Pipeline:'''<br />
<br />
The goal of the RNN pipeline is to input each word in a text, and retrieve the contextual information surrounding the word and compute the contextual representation of the word itself. This is accomplished by the use of a bi-directional RNN, such that a Neural Tensor Layer (NTL) can combine the results of the RNN to obtain the final output. RNNs are well-suited to NLP tasks because of their ability to sequentially process data such as ordered text.<br />
<br />
A RNN is similar to a feed-forward neural network, but it relies on the use of hidden states. Hidden states are layers in the neural net that produce two outputs: <math> \hat{y}_{t} </math> and <math> h_t </math>. For a time step <math> t </math>, <math> h_t </math> is fed back into the layer to compute <math> \hat{y}_{t+1} </math> and <math> h_{t+1} </math>. <br />
<br />
Traditional RNNs are only able to remember the most recent words in a sequence, which may be problematic since words that came at the beginning of the sequence that is important to the classification problem may be forgotten. This is because the error signal needs to travel backward through the whole pathway in backpropagation. Since the activations in the backward pass are entirely linear, these activation functions will "explode" if its largest eigenvalue is greater than one and "vanish" when its largest eigenvalue is less than one, which describes the problem of vanishing/exploding gradients (Grosse). The pipeline will actually use a variant of RNN called GRU, short for Gated Recurrent Units. This is done to address the vanishing gradient problem which causes the network to struggle to memorize words that came earlier in the sequence. A GRU attempts to solve this by controlling the flow of information through the network using update and reset gates. <br />
<br />
Let <math>h_{t-1} \in \mathbb{R}^m, x_t \in \mathbb{R}^d </math> be the inputs, and let <math>\mathbf{W}_z, \mathbf{W}_r, \mathbf{W}_h \in \mathbb{R}^{m \times d}, \mathbf{U}_z, \mathbf{U}_r, \mathbf{U}_h \in \mathbb{R}^{m \times m}</math> be trainable weight matrices. Then the following equations describe the update and reset gates:<br />
<br />
<br />
<math><br />
z_t = \sigma(\mathbf{W}_zx_t + \mathbf{U}_zh_{t-1}) \text{update gate} \\<br />
r_t = \sigma(\mathbf{W}_rx_t + \mathbf{U}_rh_{t-1}) \text{reset gate} \\<br />
\tilde{h}_t = \text{tanh}(\mathbf{W}_hx_t + r_t \circ \mathbf{U}_hh_{t-1}) \text{new memory} \\<br />
h_t = (1-z_t)\circ \tilde{h}_t + z_t\circ h_{t-1}<br />
</math><br />
<br />
<br />
Note that <math> \sigma, \text{tanh}, \circ </math> are all element-wise functions. The above equations do the following:<br />
<br />
<ol><br />
<li> <math>h_{t-1}</math> carries information from the previous iteration and <math>x_t</math> is the current input </li><br />
<li> the update gate <math>z_t</math> controls how much past information should be forwarded to the next hidden state </li><br />
<li> the rest gate <math>r_t</math> controls how much past information is forgotten or reset </li><br />
<li> new memory <math>\tilde{h}_t</math> contains the relevant past memory as instructed by <math>r_t</math> and current information from the input <math>x_t</math> </li><br />
<li> then <math>z_t</math> is used to control what is passed on from <math>h_{t-1}</math> and <math>(1-z_t)</math> controls the new memory that is passed on<br />
</ol><br />
<br />
We treat <math>h_0</math> and <math> h_{n+1} </math> as zero vectors in the method. Thus, each <math>h_t</math> can be computed as above to yield results for the bi-directional RNN. The resulting hidden states <math>\overrightarrow{h_t}</math> and <math>\overleftarrow{h_t}</math> contain contextual information around the <math> t</math>-th word in forward and backward directions respectively. Contrary to convention, instead of concatenating these two vectors, it is argued that the word's contextual representation is more precise when the context information from different directions is collected and fused using a neural tensor layer as it permits greater interactions among each element of hidden states. Using these two vectors as input to the neural tensor layer, <math>V^i </math>, we compute a new representation that aggregates meanings from the forward and backward hidden states more accurately as follows:<br />
<br />
<math> <br />
[\hat{h_t}]_i = tanh(\overrightarrow{h_t}V^i\overleftarrow{h_t} + b_i) <br />
</math><br />
<br />
Where <math>V^i \in \mathbb{R}^{m \times m} </math> is the learned tensor layer, and <math> b_i \in \mathbb{R} </math> is the bias.We repeat this <math> m </math> times with different <math>V^i </math> matrices and <math> b_i </math> vectors. Through the neural tensor layer, each element in <math> [\hat{h_t}]_i </math> can be viewed as a different type of intersection between the forward and backward hidden states. In the model, <math> [\hat{h_t}]_i </math> will have the same size as the forward and backward hidden states. We then concatenate the three hidden states vectors to form a new vector that summarizes the context information :<br />
<math><br />
\overleftrightarrow{h_t} = [\overrightarrow{h_t}^T,\overleftarrow{h_t}^T,\hat{h_t}]^T <br />
</math><br />
<br />
We calculate this vector for every word in the text and then stack them all into matrix <math> H </math> with shape <math>n</math>-by-<math>3m</math>.<br />
<br />
<math><br />
H = [\overleftrightarrow{h_1};...\overleftrightarrow{h_n}]<br />
</math><br />
<br />
This <math>H</math> matrix is then forwarded as the results from the Recurrent Neural Network.<br />
<br />
<br />
'''CNN Pipeline:'''<br />
<br />
The goal of the CNN pipeline is to learn the relative importance of words in an input sequence based on different aspects. The process of this CNN pipeline is summarized as the following steps:<br />
<br />
<ol><br />
<li> Given a sequence of words, each word is converted into a word vector using the word2vec algorithm which gives matrix X. <br />
</li><br />
<br />
<li> Word vectors are then convolved through the temporal dimension with filters of various sizes (ie. different K) with learnable weights to capture various numerical K-gram representations. These K-gram representations are stored in matrix C.<br />
</li><br />
<br />
<ul><br />
<li> The convolution makes this process capture local and position-invariant features. Local means the K words are contiguous. Position-invariant means K contiguous words at any position are detected in this case via convolution.<br />
<br />
<li> Temporal dimension example: convolve words from 1 to K, then convolve words 2 to K+1, etc<br />
</li><br />
</ul><br />
<br />
<li> Since not all K-gram representations are equally meaningful, there is a learnable matrix W which takes the linear combination of K-gram representations to more heavily weigh the more important K-gram representations for the classification task.<br />
</li><br />
<br />
<li> Each linear combination of the K-gram representations gives the relative word importance based on the aspect that the linear combination encodes.<br />
</li><br />
<br />
<li> The relative word importance vs aspect gives rise to an interpretable attention matrix A, where each element says the relative importance of a specific word for a specific aspect.<br />
</li><br />
<br />
</ol><br />
<br />
[[File:Group12_Figure1.png |center]]<br />
<br />
<div align="center">Figure 1: The architecture of CRNN.</div><br />
<br />
== Merging RNN & CNN Pipeline Outputs ==<br />
<br />
The results from both the RNN and CNN pipeline can be merged by simply multiplying the output matrices. That is, we compute <math>S=A^TH</math> which has shape <math>z \times 3m</math> and is essentially a linear combination of the hidden states. The concatenated rows of S results in a vector in <math>\mathbb{R}^{3zm}</math> and can be passed to a fully connected Softmax layer to output a vector of probabilities for our classification task. <br />
<br />
To train the model, we make the following decisions:<br />
<ul><br />
<li> Use cross-entropy loss as the loss function (A cross-entropy loss function usually takes in two distributions, a true distribution p and an estimated distribution q, and measures the average number of bits need to identify an event. This calculation is independent of the kind of layers used in the network as well as the kind of activation being implemented.) </li><br />
<li> Perform dropout on random columns in matrix C in the CNN pipeline </li><br />
<li> Perform L2 regularization on all parameters </li><br />
<li> Use stochastic gradient descent with a learning rate of 0.001 </li><br />
</ul><br />
<br />
== Interpreting Learned CRNN Weights ==<br />
<br />
Recall that attention matrix A essentially stores the relative importance of every word in the input sequence for every aspect chosen. Naturally, this means that A is an n-by-z matrix, with n being the number of words in the input sequence and z being the number of aspects considered in the classification task. <br />
<br />
Furthermore, for any specific aspect, words with higher attention values are more important relative to other words in the same input sequence. likewise, for any specific word, aspects with higher attention values prioritize the specific word more than other aspects.<br />
<br />
For example, in this paper, a sentence is sampled from the Movie Reviews dataset, and the transpose of attention matrix A is visualized. Each word represents an element in matrix A, the intensity of red represents the magnitude of an attention value in A, and each sentence is the relative importance of each word for a specific context. In the first row, the words are weighted in terms of a positive aspect, in the last row, the words are weighted in terms of a negative aspect, and in the middle row, the words are weighted in terms of a positive and negative aspect. Notice how the relative importance of words is a function of the aspect.<br />
<br />
[[File:Interpretation example.png|800px|center]]<br />
<br />
From the above sample, it is interesting that the word "but" is viewed as a negative aspect. From a linguistic perspective, the semantic of "but" is incredibly difficult to capture because of the degree of contextual information it needs. In this case, "but" is in the middle of a transition from a negative to a positive so the first row should also have given attention to that word. Also, it seems that the model has learned to give very high attention to the two words directly adjacent to the word of high attention: "is" and "and" beside "powerful", and "an" and "cast" beside "unwieldy".<br />
<br />
The paper also shows that the model can determine important words in the news. The authors take Ag's news dataset and randomly select 10 important words for each class. The table (shown below) contains eye-catching words which fit their classes well.<br />
<br />
[[File:news_words.png|480px|center]]<br />
<br />
== Conclusion & Summary ==<br />
<br />
This paper proposed a new architecture, the Convolutional Recurrent Neural Network, for text classification. The Convolutional Neural Network is used to learn the relative importance of each word from their different aspects and stores this information into a weight matrix. The Recurrent Neural Network learns each word's contextual representation through the combination of the forward and backward context information that is fused using a neural tensor layer and is stored as a matrix. These two matrices are then combined to get the text representation used for classification. Although the specifics of the performed tests are lacking, the experiment's results indicate that the proposed method performed well in comparison to most previous methods. In addition to performing well, the proposed method also provides insight into which words contribute greatly to the classification decision as to the learned matrix from the Convolutional Neural Network stores the relative importance of each word. This information can then be used in other applications or analyses. In the future, one can explore the features extracted from the model and use them to potentially learn new methods such as model space. [5]<br />
<br />
== Critiques ==<br />
<br />
1. In the '''Method''' section of the paper, some explanations used the same notation for multiple different elements of the model. This made the paper harder to follow and understand since they were referring to different elements by identical notation. Additionally, the decision to use sigmoid and hyperbolic tangent functions as nonlinearities for representation learning is not supported with evidence that these are optimal.<br />
<br />
2. In '''Comparison of Methods''', the authors discuss the range of hyperparameter settings that they search through. While some of the hyperparameters have a large range of search values, three parameters are fixed without much explanation as to why for all experiments, size of the hidden state of GRU, number of layers, and dropout. These parameters have a lot to do with the complexity of the model and this paper could be improved by providing relevant reasoning behind these values, or by providing additional experimental results over different values of these parameters.<br />
<br />
3. In the '''Results''' section of the paper, they tried to show that the classification results from the CRNN model can be better interpreted than other models. In these explanations, the details were lacking and the authors did not adequately demonstrate how their model is better than others.<br />
<br />
4. Finally, in the '''Results''' section again, the paper compares the CRNN model to several models which they did not implement and reproduce results with. This can be seen in the chart of results above, where several models do not have entries in the table for all six datasets. Since the authors used a subset of the datasets, these other models which were not reproduced could have different accuracy scores if they had been tested on the same data as the CRNN model. This difference in training and testing data is not mentioned in the paper, and the conclusion that the CRNN model is better in all cases may not be valid.<br />
<br />
5. Considering the general methodology, the author in the paper chose to fuse CNN with Gated recurrent unit (GRU), which is only one version of RNN. However, it has been shown that LSTM generally performs better than GRU, and the author should discuss their choice of using GRU in CRNN instead of LSTM in more detail. Looking at the authors' experimental results, LSTM even outperforms CRNN (with GRU) in some cases, which further motivates the idea of adopting LSTM for RNN to combine with CNN. Whether combining LSTM with CNN will lead to a better performance will of course need further verification, but in principle author should at least address the issue. <br />
<br />
6. This is an interesting method, I would be curious to see if this can be combined or compared with Quasi-Recurrent Neural Networks (https://arxiv.org/abs/1611.01576). In my experience, QRNNs perform similarly to LSTMs while running significantly faster using convolutions with a special temporal pooling. This seems compatible with the neural tensor layer proposed in this paper, which may be combined to yield stronger performance with faster runtimes.<br />
<br />
7. It would be interesting to see how the attention matrix is being constructed and how attention values are being determined in each matrix. For instance, does every different subject have its own attention matrix? If so, how will the situation be handled when the same attention matrix is used in different settings?<br />
<br />
8. The paper shows the CRNN model not performing the best with Ag's news and 20newsgroups. It would be interesting to investigate this in detail and see the difference in the way the data is handled in the model compared to the best performing model(self-attentive LSTM in both datasets).<br />
<br />
9. From the Interpreting Learned CRNN Weights part, the samples are labeled as positive and negative, and their words all have opposite emotional polarities. It can be observed that regardless of whether the polarity of the example is positive or negative, the keyword can be extracted by this method, reflecting that it can capture multiple semantically meaningful components. At the same time, it will be very interesting to see if this method is applicable to other specific categories.<br />
<br />
10. The authors of this paper provide 2 examples of what topic classification is, but do not provide any explicit examples of "polysemic words whose meanings are context-sensitive", one of their main critiques of current methods. This is an opportunity to promote the use of their method and engage and inform the reader, simply by listing examples of these words.<br />
<br />
11. In the "Interpreting Learned CRNN Weights" parts, authors gave a figure but did not explain whether this is an example of positive case or negative case. I suggest the author to label the figure. Also, in the paper, we can see there are 2 figures and both of the figures have been labelled as positive or negative but there is only one figure without labelled in the summary.<br />
<br />
12. In the CRNN Results vs Benchmarks, it would be better to provide more details and comparison of other methods other than CRNN. It would be clear and more intuitive for the readers why the author will choose CRNN rather than the others.<br />
<br />
13. The author should explore more on why labeling from certain data source is much more accurate than other data source.<br />
<br />
14. Nice visual illustration for explaining the whole process. More explanations can be done on how two neural networks are combined. Would the result be different if two neural networks were modelled sequentially?<br />
<br />
15. The author could provide more information on the training process. For instance, with the given configuration and learning rate, how long did it take for the model's training accuracy to converge. Also, it would be interesting to see the effects of using other types of units like LSTM.<br />
<br />
16. It would be more understandable if the authors can provide more details (maybe some visual results) of how to merge the 2 algorithms together!<br />
<br />
== Comments ==<br />
<br />
- Could this be applied to ancient languages such as hieroglyphs to decipher/better understand them?<br />
<br />
- Another application for CRNN might be classifying spoken language as well as any form of audio.<br />
<br />
-I think it will be better to show more results by using this method. Maybe it will be better to put the result part after the architecture part? Writing a motivation will be better since it will catch readers' "eyes". I think it will be interesting to ask: whether can we apply this to ancient Chinese poetry? Since there are lots of types of ancient Chinese poetry, doing a classification for them will be interesting.<br />
<br />
- In another [https://www.aclweb.org/anthology/W99-0908/ paper] written by Andrew McCallum and Kamal Nigam, they introduce a different method of text classification. Namely, instead of a combination of recurrent and convolutional neural networks, they instead utilized bootstrapping with keywords, Expectation-Maximization algorithm, and shrinkage.<br />
<br />
- The author could add more comparison of text classification with other algorithms that have the same functionalities such as compare CRNN with clustering of text to do classification.<br />
<br />
== References ==<br />
----<br />
<br />
[1] Grimes, Seth. “Unstructured Data and the 80 Percent Rule.” Breakthrough Analysis, 1 Aug. 2008, breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/.<br />
<br />
[2] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modeling sentences,”<br />
arXiv preprint arXiv:1404.2188, 2014.<br />
<br />
[3] K. Cho, B. V. Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning<br />
phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint<br />
arXiv:1406.1078, 2014.<br />
<br />
[4] Keren, G., &amp; Schuller, B. (2016, February 18). Convolutional RNN: An Enhanced Model for Extracting Features from Sequential Data. Retrieved November 30, 2020, from https://arxiv.org/pdf/1602.05875.pdf.<br />
<br />
[5] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in Proceedings<br />
of AAAI, 2015, pp. 2267–2273.<br />
<br />
[6] H. Chen, P. Tio, A. Rodan, and X. Yao, “Learning in the model space for cognitive fault diagnosis,” IEEE<br />
Transactions on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 124–136, 2014.<br />
<br />
[7] Grosse, R. Lecture 15: Exploding and Vanishing Gradients. Lecture presented at CSC 321 in University of Toronto, Toronto. Retrieved from http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/readings/L15 Exploding and Vanishing Gradients.pdf&usg=AOvVaw0tbIdxkUR1WsUtyIf4HVSn</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=49666Summary for survey of neural networked-based cancer prediction models from microarray data2020-12-07T03:08:49Z<p>X277lu: /* Neural network-based cancer prediction models */</p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology can help researchers quickly detect genetic information and is widely used to analyze genetic diseases. Researchers use this technology to compare normal and abnormal cancerous tissues to gain insights into the pathology of cancer. <br />
<br />
However, due to the high dimensionality of the gene expressions, the accuracy, and computation time of the model might be affected. The following two approaches are adopted to cope with this problem: the feature selection method and the feature creating method. The formal one, similar to the well-known principal component analysis method, aims to focus on key features and ignore minor noises. On the other hand, the latter one, similar to scale-invariant feature transformations, aims to combine existing features or map them to a new low-dimensional space.<br />
<br />
Compared to neural network models presented by others, this paper is specifically designed for predicting cancer using gene expression data. Thus, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions.<br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function such as sigmoid or rectified linear activation functions. To train the network, the inputs are fed forward and the activation function value is calculated at every neuron. The difference between the output of the neural network and the actual value is what we call an error.<br />
The backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and a different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
Cancer prediction models often contain more than 1 method to achieve high prediction accuracy with a more accurate prognosis and it also aims to reduce the cost of patients.<br />
<br />
High dimensionality and spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
<br />
The first way is called the ROC curve. ROC curves, receiver operating characteristic curves, are graphs that show the true positive rate against the false-positive rate [4]. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is considered acceptable when its ROC is greater than 0.7. <br />
<br />
A different machine learning problem is predicting the survival time. The performance of a model that predicts survival time can be measured using two metrics. The problem of predicting survival time can be seen as a ranking problem, where survival times of different subjects are ranked against each other. CI (Concordance Index) is a measure of how good a model ranks survival times and explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. We can express the ordering of survival times in an order graph <math display="inline">G = (V, E)</math> where the vertices <math display="inline">V</math> are the individual survival times, and the edges <math display="inline">E_{ij}</math> from individual <math display="inline">i</math> to <math display="inline">j</math> indicate that <math display="inline">T_i < T_j</math>, where <math display="inline">T_i, T_j</math> are the survival times for individuals <math display="inline">i,j</math> respectively. Then we can write <math display="inline">CI = \frac{1}{|E|}\sum_{i,j}I(f(i) < f(j))</math> where <math display="inline">|E|</math> is the number of edges in the order graph (ie. the total number of comparable pairs), and <math display="inline">I(f(i) < f(j)) = 1</math> if the predictor <math display="inline">f</math> correctly ranks <math display="inline">T_i < T_j</math> [3].<br />
<br />
Another metric is the Brier score, which measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy. It is defined as <math display="inline">\frac{1}{n}\sum_{i=1}^n(f_i - o_i)^2</math> where <math display="inline">f_i</math> is the predicted survival rate, and <math display="inline">o_i</math> is the observed survival rate [2].<br />
<br />
== Neural network-based cancer prediction models ==<br />
An extensive search relevant to neural network-based cancer prediction was performed using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, and only articles between 2013 and 2018 with available accessibility were considered. The chosen papers covered cancer classification, discovery, survivability prediction, and statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive, and clustering for chosen papers. <br />
<br />
[[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p-value > <math>10^{-5}</math> to remove some unwanted technical variation and <math>\log_2</math> transformations. Statistical methods, neural networks, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the dataset's features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcome the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to a sparse matrix problem. This is because variability needed to be included in the generated samples. Consequently, randomly chosen features would be set to zero, which may cause the sparse matrix problem (Daoud & Mayo, 2019). Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups, and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that neural network methods are used for filtering, predicting, and clustering in cancer prediction. <br />
<br />
*''filtering'': Filter the gene expressions to eliminate noise or reduce dimensionality. Then use the resulted features with statistical methods or machine learning classification and clustering tools as figure 2 indicates.<br />
<br />
*''predicting'': Extract features and improve the accuracy of prediction (classification).<br />
<br />
*''clustering'': Divide the gene expressions or samples based on similarity.<br />
<br />
[[File:filtering gane.png]]<br />
<br />
All the neurons in the neural network work together as feature detectors to learn the features from the input. In order to categorize a neural network as filtering, predicting, or clustering method, we looked at the overall role that network provided within the framework of cancer prediction. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M that is smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithms such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy; meanwhile, clustering methods are unsupervised, which group similar samples or genes together. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group for a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering, and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
<br />
$$ RMSE = \sqrt{ \frac{\sum{(I(x)-O(x))^2}}{n} } $$<br />
<br />
$$ Logloss = \sum{(I(x)\log(O(x)) + (1 - I(x))\log(1 - O(x)))} $$<br />
<br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders, and variational autoencoders. The architecture of the networks varies in many parameters, such as depth and loss function. Each example of an autoencoder mentioned above has a different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient descent optimization, Adam optimizer).<br />
<br />
Overfitting is a major problem that most autoencoders need to deal with to achieve high efficiency of the extracted features. Regularization, dropout, and sparsity are common solutions.<br />
<br />
The neural network filtering methods were used by different statistical methods and classifiers. The conventional methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoost or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
<br />
The codeword is a binary string <math>C'k</math> of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to map the input to the codeword iteratively, whose objective function is minimized in each iteration.<br />
<br />
Such cancer classifiers were applied to identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as a classifier is better than those of SVM, logistic regression, naïve Bayes, classification trees, and KNN.<br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without a backpropagation mechanism, is one of the traditional model-based techniques to be applied to gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment or both. The high dimensionality of gene expression samples poses a problem for traditional clustering algorithms such as k-means clustering, which uses a distance function to separate samples. Such an approach is not viable for high dimensional datasets. To solve the high dimensionality problem, there are two methods: clustering ensembles by running a single clustering algorithm several times, each of which has different initialization or number of parameters; and projective clustering by only considering a subset of the original features.<br />
<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were not easy to be obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken the gene-to-cluster assignment into consideration.<br />
<br />
Also, the paper provides a double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.<br />
<br />
== Summary ==<br />
<br />
Cancer is a disease with a high mortality rate that kills millions of people every year, and it’s essential to analyze gene expression for discovering gene abnormalities and increasing survivability as a consequence. The previous analysis in the paper reveals that neural networks are essentially used for filtering the gene expressions, predicting their class, or clustering them.<br />
<br />
Neural network filtering methods are used to reduce the dimensionality of the gene expressions and remove their noise. In the article, the authors recommended deep architectures in comparison to shallow architectures for best practice, as they combine many nonlinearities. <br />
<br />
Neural network prediction methods can be used for both binary and multi-class problems. In the binary case, the network architecture has only one or two output neurons that diagnose a given sample as cancerous or non-cancerous. In comparison, the number of the output neurons in multi-class problems is equal to the number of classes. The authors suggested that the deep architecture with convolution layers which was the most recently used model proved efficient capability in predicting cancer subtypes, as it captures the spatial correlations between gene expressions.<br />
Clustering is another analysis tool that is used to divide the gene expressions into groups. The authors indicated that a hybrid approach combining both the assembling, clustering and projective clustering is more accurate than using single-point clustering algorithms, such as SOM, since those methods do not have the capability to distinguish the noisy genes.<br />
<br />
==Discussion==<br />
There are some technical problems that can be considered and improved for building new models. <br><br />
<br />
1. Overfitting: Since gene expression datasets are high dimensional and have a relatively small number of samples, it would be likely to properly fits the training data but not accurate for test samples due to the lack of generalization capability. The ways to avoid overfitting can be: (1). adding weight penalties using regularization; (2). using the average predictions from many models trained on different datasets; (3). dropout. (4) Augmentation of the dataset to produce more "observations".<br><br />
<br />
2. Model configuration and training: In order to reduce both the computational and memory expenses but also with high prediction accuracy, it’s crucial to properly set the network parameters. The possible ways can be: (1). proper initialization; (2). pruning the unimportant connections by removing the zero-valued neurons; (3). using ensemble learning framework by training different models using different parameter settings or using different parts of the dataset for each base model; (4). Using SMOTE for dealing with class imbalance on the high dimensional level. <br><br />
<br />
3. Model evaluation: In Braga-Neto and Dougherty's research, they have investigated several model evaluation methods: cross-validation, substitution and bootstrap methods. The cross-validation was found to be unreliable for small size data since it displayed excessive variance. The bootstrap method proved more accurate predictability.<br><br />
<br />
4. Study producibility: A study needs to be reproducible to enhance research reliability so that others can replicate the results using the same algorithms, data and methodology. Hence, the query used for getting the dataset should be stated.<br />
<br />
==Conclusion==<br />
This paper reviewed the most recent neural network-based cancer prediction models and gene expression analysis tools. The analysis indicates that the neural network methods are able to serve as filters, predictors, and clustering methods, and also showed that the role of the neural network determines its general architecture. The authors showed that Neural Network filtering methods are a way of reducing the dimensionality of the gene expressions, as well as removing their noise for better model fitting. To give suggestions for future neural network-based approaches, the authors highlighted some critical points that have to be considered such as overfitting and class imbalance, and suggest choosing different network parameters or combining two or more of the presented approaches. One of the biggest challenges for cancer prediction modelers is deciding on the network architecture (i.e. the number of hidden layers and neurons), as there are currently no guidelines to follow to obtain high prediction accuracy. The authors discovered that there is no algorithm available to concretely determine an optimal number of hidden layers or nodes and found that many papers simply implemented a trial and error method to reduce loss in the model.<br />
<br />
==Critiques==<br />
<br />
While results indicate that the functionality of the neural network determines its general architecture, the decision on the number of hidden layers, neurons, hypermeters, and learning algorithms is made using trial-and-error techniques. Therefore improvements in this area of the model might need to be explored in order to obtain better results and in order to make more convincing statements.<br />
<br />
An issue that one must be mindful of is the underlying distribution of data. Cancer is an extremely complex genetic disease and the predictions would depend on so many variables, a number of which will not even be present in the dataset as they might have been collected. So there is a need for extensive validation when it comes to applying deep learning methods to cancer-related data.<br />
<br />
In the field of medical sciences and molecular biology, interpretability of results is imperative as often experts seek not just to solve the issue at hand but to understand the causal relationships. Having a high ROC value may not necessarily convince other experts on the validity of the finding because the underlying details of cancer symptoms have been abstracted in a complex neural network as a black box. However, the neural network clustering method suggested in this paper does offer a good compromise because it enables humans to visual low-level features but still gives experts the control on making various predictions using well-studied traditional techniques.<br />
<br />
With high dimensionality features, kernel SVM is another option for cancer prediction. Jiang et. al. developed a Hadamard Kernel for predicting breast cancer using gene expression data, and it utilizes the Kernel trick to avoid high computational efforts (link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5763304/). Compared against linear, quadratic, RBF and correlation kernels, Hadamard Kernel performs best with the highest averaged area under the ROC curve (AUC) value. It may be interesting to compare the performance and accuracy between the Hadamard Kernel and cancer prediction models with various number of hidden layers and neurons.<br />
<br />
Although the authors presented technical details about data processing, training approaches evaluation metric, and addressed many practical issues that can be considered for cancer prediction, no novel methods or models are proposed. It's more like a proof-of-concept about the feasibility of different models on cancer prediction.<br />
<br />
It would be interesting for the researchers to compare the performance between causal inference and neural network models on this data.<br />
<br />
As the authors indicate neural networks would be a useful tool for cancer prediction models, the article is lacking an example for implementing neural networks to provide persuasive support for their arguments.<br />
<br />
The inheritance of cancer is complex and changeable. The predicted variables are therefore very complicated, so for the model of the learning data set, a more adequate training set is needed to learn. And multi-party verification of the learned model.<br />
<br />
The authors mentioned many different neural network models and compared them. It would be better if more details of a commonly applied model with relatively high accuracy could be given such as how the model is built. An article named Convolutional neural network models for cancer type prediction based on gene expression gives explanations of CNN in detail.<br />
<br />
The authors briefly discussed methods and algorithms being used in the presented paper in their summary. However, a very little amount of technical details were provided to the readers. The summary itself is lacking specific examples for the aforementioned algorithms, and datasets which were used in the original analysis were only introduced in one or two sentences. As a result, the summary and conclusion appear to be unconvincing to the readers.<br />
<br />
PCA can still be used as an initial preprocessing step even if it is used in a neural network whose data dimension is reduced. By merging the PCA component with a random number of original functions, some good techniques can be adopted to enable the network to capture more useful relationships.<br />
Correlation based feature selection (CFS) might be also applied in the data pre-processing step to reduce the dimensionality.<br />
<br />
The key part of this model is to extract the features from the model. However, cancer may depends on many explanatory variables. Thus how do we know which feature should we extract in the data preprocessing. Since there are correlations between each variables. Authors did not specify this situation.<br />
<br />
From the biology side of view, gene expression is really complicated. Thus reducing dimension may or may not be the best way of predicting cancer, and this should be a controversial topic.<br />
<br />
The author should explore more on the reason predicting certain cancers are more accurate than other cancers. Also the medical conditions are different among all individuals, there are more features that need to be considered when doing this model.<br />
<br />
==Reference==<br />
[1] Daoud, M., & Mayo, M. (2019). A survey of neural network-based cancer prediction models from microarray data. Artificial Intelligence in Medicine, 97, 204–214.<br />
<br />
[2] Brier GW. 1950. Verification of forecasts expressed in terms of probabilities. Monthly Weather Review 78: 1–3<br />
<br />
[3] Harald Steck, Balaji Krishnapuram, Cary Dehing-oberije, Philippe Lambin, Vikas C. Raykar. On ranking in survival analysis: Bounds on the concordance index. In Advances in Neural Information Processing Systems (2008), pp. 1209-1216<br />
<br />
[4] Google Developers. (2020 February 10). ''Classification: ROC Curve and AUC.'' Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc<br />
<br />
[5] Mostavi, M., Chiu, YC., Huang, Y. et al. Convolutional neural network models for cancer type prediction based on gene expression. ''BMC Med Genomics'' '''13''', 44 (2020). https://doi.org/10.1186/s12920-020-0677-2</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=49664Summary for survey of neural networked-based cancer prediction models from microarray data2020-12-07T03:07:08Z<p>X277lu: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology can help researchers quickly detect genetic information and is widely used to analyze genetic diseases. Researchers use this technology to compare normal and abnormal cancerous tissues to gain insights into the pathology of cancer. <br />
<br />
However, due to the high dimensionality of the gene expressions, the accuracy, and computation time of the model might be affected. The following two approaches are adopted to cope with this problem: the feature selection method and the feature creating method. The formal one, similar to the well-known principal component analysis method, aims to focus on key features and ignore minor noises. On the other hand, the latter one, similar to scale-invariant feature transformations, aims to combine existing features or map them to a new low-dimensional space.<br />
<br />
Compared to neural network models presented by others, this paper is specifically designed for predicting cancer using gene expression data. Thus, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions.<br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function such as sigmoid or rectified linear activation functions. To train the network, the inputs are fed forward and the activation function value is calculated at every neuron. The difference between the output of the neural network and the actual value is what we call an error.<br />
The backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and a different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
Cancer prediction models often contain more than 1 method to achieve high prediction accuracy with a more accurate prognosis and it also aims to reduce the cost of patients.<br />
<br />
High dimensionality and spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
<br />
The first way is called the ROC curve. ROC curves, receiver operating characteristic curves, are graphs that show the true positive rate against the false-positive rate [4]. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is considered acceptable when its ROC is greater than 0.7. <br />
<br />
A different machine learning problem is predicting the survival time. The performance of a model that predicts survival time can be measured using two metrics. The problem of predicting survival time can be seen as a ranking problem, where survival times of different subjects are ranked against each other. CI (Concordance Index) is a measure of how good a model ranks survival times and explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. We can express the ordering of survival times in an order graph <math display="inline">G = (V, E)</math> where the vertices <math display="inline">V</math> are the individual survival times, and the edges <math display="inline">E_{ij}</math> from individual <math display="inline">i</math> to <math display="inline">j</math> indicate that <math display="inline">T_i < T_j</math>, where <math display="inline">T_i, T_j</math> are the survival times for individuals <math display="inline">i,j</math> respectively. Then we can write <math display="inline">CI = \frac{1}{|E|}\sum_{i,j}I(f(i) < f(j))</math> where <math display="inline">|E|</math> is the number of edges in the order graph (ie. the total number of comparable pairs), and <math display="inline">I(f(i) < f(j)) = 1</math> if the predictor <math display="inline">f</math> correctly ranks <math display="inline">T_i < T_j</math> [3].<br />
<br />
Another metric is the Brier score, which measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy. It is defined as <math display="inline">\frac{1}{n}\sum_{i=1}^n(f_i - o_i)^2</math> where <math display="inline">f_i</math> is the predicted survival rate, and <math display="inline">o_i</math> is the observed survival rate [2].<br />
<br />
== Neural network-based cancer prediction models ==<br />
An extensive search relevant to neural network-based cancer prediction was performed using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, and only articles between 2013 and 2018 with available accessibility were considered. The chosen papers covered cancer classification, discovery, survivability prediction, and statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive, and clustering for chosen papers. <br />
<br />
[[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p-value > <math>10^{-5}</math> to remove some unwanted technical variation and <math>\log_2</math> transformations. Statistical methods, neural networks, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the dataset's features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcome the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to a sparse matrix problem. This is because variability needed to be included in the generated samples. Consequently, randomly chosen features would be set to zero, which may cause the sparse matrix problem (Daoud & Mayo, 2019). Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups, and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that neural network methods are used for filtering, predicting, and clustering in cancer prediction. <br />
<br />
*''filtering'': Filter the gene expressions to eliminate noise or reduce dimensionality. Then use the resulted features with statistical methods or machine learning classification and clustering tools as figure 2 indicates.<br />
<br />
*''predicting'': Extract features and improve the accuracy of prediction (classification).<br />
<br />
*''clustering'': Divide the gene expressions or samples based on similarity.<br />
<br />
[[File:filtering gane.png]]<br />
<br />
All the neurons in the neural network work together as feature detectors to learn the features from the input. In order to categorize a neural network as filtering, predicting, or clustering method, we looked at the overall role that network provided within the framework of cancer prediction. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithms such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy; meanwhile, clustering methods are unsupervised, which group similar samples or genes together. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group for a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering, and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
<br />
$$ RMSE = \sqrt{ \frac{\sum{(I(x)-O(x))^2}}{n} } $$<br />
<br />
$$ Logloss = \sum{(I(x)\log(O(x)) + (1 - I(x))\log(1 - O(x)))} $$<br />
<br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders, and variational autoencoders. The architecture of the networks varies in many parameters, such as depth and loss function. Each example of an autoencoder mentioned above has a different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient descent optimization, Adam optimizer).<br />
<br />
Overfitting is a major problem that most autoencoders need to deal with to achieve high efficiency of the extracted features. Regularization, dropout, and sparsity are common solutions.<br />
<br />
The neural network filtering methods were used by different statistical methods and classifiers. The conventional methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoost or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
<br />
The codeword is a binary string <math>C'k</math> of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to map the input to the codeword iteratively, whose objective function is minimized in each iteration.<br />
<br />
Such cancer classifiers were applied to identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as a classifier is better than those of SVM, logistic regression, naïve Bayes, classification trees, and KNN.<br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without a backpropagation mechanism, is one of the traditional model-based techniques to be applied to gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment or both. The high dimensionality of gene expression samples poses a problem for traditional clustering algorithms such as k-means clustering, which uses a distance function to separate samples. Such an approach is not viable for high dimensional datasets. To solve the high dimensionality problem, there are two methods: clustering ensembles by running a single clustering algorithm several times, each of which has different initialization or number of parameters; and projective clustering by only considering a subset of the original features.<br />
<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were not easy to be obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken the gene-to-cluster assignment into consideration.<br />
<br />
Also, the paper provides a double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.<br />
<br />
== Summary ==<br />
<br />
Cancer is a disease with a high mortality rate that kills millions of people every year, and it’s essential to analyze gene expression for discovering gene abnormalities and increasing survivability as a consequence. The previous analysis in the paper reveals that neural networks are essentially used for filtering the gene expressions, predicting their class, or clustering them.<br />
<br />
Neural network filtering methods are used to reduce the dimensionality of the gene expressions and remove their noise. In the article, the authors recommended deep architectures in comparison to shallow architectures for best practice, as they combine many nonlinearities. <br />
<br />
Neural network prediction methods can be used for both binary and multi-class problems. In the binary case, the network architecture has only one or two output neurons that diagnose a given sample as cancerous or non-cancerous. In comparison, the number of the output neurons in multi-class problems is equal to the number of classes. The authors suggested that the deep architecture with convolution layers which was the most recently used model proved efficient capability in predicting cancer subtypes, as it captures the spatial correlations between gene expressions.<br />
Clustering is another analysis tool that is used to divide the gene expressions into groups. The authors indicated that a hybrid approach combining both the assembling, clustering and projective clustering is more accurate than using single-point clustering algorithms, such as SOM, since those methods do not have the capability to distinguish the noisy genes.<br />
<br />
==Discussion==<br />
There are some technical problems that can be considered and improved for building new models. <br><br />
<br />
1. Overfitting: Since gene expression datasets are high dimensional and have a relatively small number of samples, it would be likely to properly fits the training data but not accurate for test samples due to the lack of generalization capability. The ways to avoid overfitting can be: (1). adding weight penalties using regularization; (2). using the average predictions from many models trained on different datasets; (3). dropout. (4) Augmentation of the dataset to produce more "observations".<br><br />
<br />
2. Model configuration and training: In order to reduce both the computational and memory expenses but also with high prediction accuracy, it’s crucial to properly set the network parameters. The possible ways can be: (1). proper initialization; (2). pruning the unimportant connections by removing the zero-valued neurons; (3). using ensemble learning framework by training different models using different parameter settings or using different parts of the dataset for each base model; (4). Using SMOTE for dealing with class imbalance on the high dimensional level. <br><br />
<br />
3. Model evaluation: In Braga-Neto and Dougherty's research, they have investigated several model evaluation methods: cross-validation, substitution and bootstrap methods. The cross-validation was found to be unreliable for small size data since it displayed excessive variance. The bootstrap method proved more accurate predictability.<br><br />
<br />
4. Study producibility: A study needs to be reproducible to enhance research reliability so that others can replicate the results using the same algorithms, data and methodology. Hence, the query used for getting the dataset should be stated.<br />
<br />
==Conclusion==<br />
This paper reviewed the most recent neural network-based cancer prediction models and gene expression analysis tools. The analysis indicates that the neural network methods are able to serve as filters, predictors, and clustering methods, and also showed that the role of the neural network determines its general architecture. The authors showed that Neural Network filtering methods are a way of reducing the dimensionality of the gene expressions, as well as removing their noise for better model fitting. To give suggestions for future neural network-based approaches, the authors highlighted some critical points that have to be considered such as overfitting and class imbalance, and suggest choosing different network parameters or combining two or more of the presented approaches. One of the biggest challenges for cancer prediction modelers is deciding on the network architecture (i.e. the number of hidden layers and neurons), as there are currently no guidelines to follow to obtain high prediction accuracy. The authors discovered that there is no algorithm available to concretely determine an optimal number of hidden layers or nodes and found that many papers simply implemented a trial and error method to reduce loss in the model.<br />
<br />
==Critiques==<br />
<br />
While results indicate that the functionality of the neural network determines its general architecture, the decision on the number of hidden layers, neurons, hypermeters, and learning algorithms is made using trial-and-error techniques. Therefore improvements in this area of the model might need to be explored in order to obtain better results and in order to make more convincing statements.<br />
<br />
An issue that one must be mindful of is the underlying distribution of data. Cancer is an extremely complex genetic disease and the predictions would depend on so many variables, a number of which will not even be present in the dataset as they might have been collected. So there is a need for extensive validation when it comes to applying deep learning methods to cancer-related data.<br />
<br />
In the field of medical sciences and molecular biology, interpretability of results is imperative as often experts seek not just to solve the issue at hand but to understand the causal relationships. Having a high ROC value may not necessarily convince other experts on the validity of the finding because the underlying details of cancer symptoms have been abstracted in a complex neural network as a black box. However, the neural network clustering method suggested in this paper does offer a good compromise because it enables humans to visual low-level features but still gives experts the control on making various predictions using well-studied traditional techniques.<br />
<br />
With high dimensionality features, kernel SVM is another option for cancer prediction. Jiang et. al. developed a Hadamard Kernel for predicting breast cancer using gene expression data, and it utilizes the Kernel trick to avoid high computational efforts (link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5763304/). Compared against linear, quadratic, RBF and correlation kernels, Hadamard Kernel performs best with the highest averaged area under the ROC curve (AUC) value. It may be interesting to compare the performance and accuracy between the Hadamard Kernel and cancer prediction models with various number of hidden layers and neurons.<br />
<br />
Although the authors presented technical details about data processing, training approaches evaluation metric, and addressed many practical issues that can be considered for cancer prediction, no novel methods or models are proposed. It's more like a proof-of-concept about the feasibility of different models on cancer prediction.<br />
<br />
It would be interesting for the researchers to compare the performance between causal inference and neural network models on this data.<br />
<br />
As the authors indicate neural networks would be a useful tool for cancer prediction models, the article is lacking an example for implementing neural networks to provide persuasive support for their arguments.<br />
<br />
The inheritance of cancer is complex and changeable. The predicted variables are therefore very complicated, so for the model of the learning data set, a more adequate training set is needed to learn. And multi-party verification of the learned model.<br />
<br />
The authors mentioned many different neural network models and compared them. It would be better if more details of a commonly applied model with relatively high accuracy could be given such as how the model is built. An article named Convolutional neural network models for cancer type prediction based on gene expression gives explanations of CNN in detail.<br />
<br />
The authors briefly discussed methods and algorithms being used in the presented paper in their summary. However, a very little amount of technical details were provided to the readers. The summary itself is lacking specific examples for the aforementioned algorithms, and datasets which were used in the original analysis were only introduced in one or two sentences. As a result, the summary and conclusion appear to be unconvincing to the readers.<br />
<br />
PCA can still be used as an initial preprocessing step even if it is used in a neural network whose data dimension is reduced. By merging the PCA component with a random number of original functions, some good techniques can be adopted to enable the network to capture more useful relationships.<br />
Correlation based feature selection (CFS) might be also applied in the data pre-processing step to reduce the dimensionality.<br />
<br />
The key part of this model is to extract the features from the model. However, cancer may depends on many explanatory variables. Thus how do we know which feature should we extract in the data preprocessing. Since there are correlations between each variables. Authors did not specify this situation.<br />
<br />
From the biology side of view, gene expression is really complicated. Thus reducing dimension may or may not be the best way of predicting cancer, and this should be a controversial topic.<br />
<br />
The author should explore more on the reason predicting certain cancers are more accurate than other cancers. Also the medical conditions are different among all individuals, there are more features that need to be considered when doing this model.<br />
<br />
==Reference==<br />
[1] Daoud, M., & Mayo, M. (2019). A survey of neural network-based cancer prediction models from microarray data. Artificial Intelligence in Medicine, 97, 204–214.<br />
<br />
[2] Brier GW. 1950. Verification of forecasts expressed in terms of probabilities. Monthly Weather Review 78: 1–3<br />
<br />
[3] Harald Steck, Balaji Krishnapuram, Cary Dehing-oberije, Philippe Lambin, Vikas C. Raykar. On ranking in survival analysis: Bounds on the concordance index. In Advances in Neural Information Processing Systems (2008), pp. 1209-1216<br />
<br />
[4] Google Developers. (2020 February 10). ''Classification: ROC Curve and AUC.'' Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc<br />
<br />
[5] Mostavi, M., Chiu, YC., Huang, Y. et al. Convolutional neural network models for cancer type prediction based on gene expression. ''BMC Med Genomics'' '''13''', 44 (2020). https://doi.org/10.1186/s12920-020-0677-2</div>X277luhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playing&diff=49662what game are we playing2020-12-07T03:03:19Z<p>X277lu: /* Critiques */</p>
<hr />
<div>== Authors == <br />
Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran <br />
<br />
== Introduction ==<br />
Recently, there have been many different studies of methods using AI to solve the optimal solution for large-scale, zero-sum, extensive form problems. However, most of these works operate under the assumption that the parameters of the game are known, while this scenario is ideal. In real-world problems, most of the parameters are unknown prior to the game. This paper proposes a framework for finding an optimal solution using a primal-dual Newton Method and then using back-propagation to analytically compute the gradients of all the relevant game parameters.<br />
<br />
The approach to solving this problem is to consider ''quantal response equilibrium'' (QRE), which is a generalization of Nash equilibrium (NE) where the agents can make suboptimal decisions. It is shown that the solution to the QRE is a differentiable function of the payoff matrix. Consequently, back-propagation can be used to analytically solve for the payoff matrix (or other game parameters). This strategy has many future application areas as it allows for game-solving (both extensive and normal form) to be integrated as a module in a deep neural network.<br />
<br />
An example of architecture is presented below:<br />
<br />
[[File:Framework.png ]]<br />
<br />
Payoff matrix <math> P </math> is parameterized by a domain-dependent low dimensional vector <math> \phi </math>, where <math> \phi </math> depends on a differentiable function <math> M_1(x) </math>. Furthermore, <math> P </math> is applied to QRE to get the equilibrium strategies <math> (u^∗, v^∗) </math>. Lastly, the loss function is calculated after applying any differentiable <math> M_2(u^∗, v^∗) </math>.<br />
<br />
The effectiveness of this model is demonstrated using the games “Rock, Paper, Scissors”, one-card poker, and a security defense game. This could represent a substantial step forward in understanding how game-theoretic methods can be applied to uncertain settings by the ability to learn solely from observed actions and the relevant parameters.<br />
<br />
== Learning and Quantal Response in Normal Form Games ==<br />
<br />
The game-solving module provides all elements required in differentiable learning, which maps contextual features to payoff matrices, and computes equilibrium strategies under a set of contextual features. This paper will learn zero-sum games and start with normal form games since they have game solver and learning approach capturing much of intuition and basic methodology.<br />
<br />
=== Zero-Sum Normal Form Games ===<br />
<br />
A normal form game is defined as a game with a finite number of players, <math>N</math>, where each player has a non-empty and finite action space, <math>A_i</math>. The outcome of the game is defined by the action profile <math>a = (a_1,...,a_n)</math>, where <math>a_i</math> is the action taken by the <math>i^{th}</math> player. In addition, each player has an associated utility function that is given by <math>u_i: A_1\times...\times A_n \rightarrow\mathbb{R}</math>. An example of a normal form game is the prisoner's dilemma. Zero-sum games refer to a game in which one player gains at the detriment of the other player. Mathematically, <math>\forall a\in A_1\times A_2, u_2(a) + u_2(a) = 0</math> (Larson, "Game-theoretic methods for computer science Normal Form Games"). <br />
<br />
In two-player zero-sum games there is a '''payoff matrix''' <math>P</math> that describes the rewards for two players employing specific strategies u and v respectively. The optimal strategy mixture may be found with a classic min-max formulation:<br />
$$\min_u \max_v \ u^T P v \\ subject \ to \ 1^T u =1, u \ge 0 \\ 1^T v =1, v \ge 0. \ $$<br />
<br />
Here, we consider the case where <math>P</math> is not known a priori. <math>P</math> could be a single fixed but unknown payoff<br />
matrix, or depend on some external context <math>x</math>. The solution <math> (u^*, v_0) </math> to this optimization and the solution <math> (u_0,v^*) </math> to the corresponding problem with inverse player order form the Nash equilibrium <math>(u^*,v^*) </math>. At this equilibrium, the players do not have anything to gain by changing their strategy, so this point is a stable state of the system. When the payoff matrix P is not known, we observe samples of actions <math> a^{(i)}, i =1,...,N </math> from one or both players, which depends on some external content <math> x </math>, sampled from the equilibrium strategies <math>(u^*,v^*) </math>, to recover the true underlying payoff matrix P or a function form P(x) depending on the current context.<br />
<br />
=== Quantal Response Equilibria ===<br />
<br />
However, NE is poorly suited because NEs are overly strict, discontinuous with respect to P, and may not be unique. To address these issues, model the players' actions with the '''quantal response equilibria''' (QRE), where noise is added to the payoff matric. Specifically, consider the ''logit'' equilibrium for zero-sum games that obeys the fixed point:<br />
$$<br />
u^* _i = \frac {exp(-Pv)_i}{\sum_{q \in [n]} exp (-Pv)_q}, \ v^* _j= \frac {exp(P^T u)_j}{\sum_{q \in [m]} exp (P^T u)_q} .\qquad \ (1)<br />
$$<br />
For a fixed opponent strategy, the logit equilibrium corresponding to a strategy is strictly convex, and thus the regularized best response is unique.<br />
<br />
=== End-to-End Learning ===<br />
<br />
Then to integrate a zero-sum solver, [1] introduced a method to solve the QRE and to differentiate through its solution.<br />
<br />
'''QRE solver''':<br />
To find the fixed point in (1), it is equivalent to solve the regularized min-max game:<br />
$$<br />
\min_{u \in \mathbb{R}^n} \max_{v \in \mathbb{R}^m} \ u^T P v -H(v) + H(u) \\<br />
\text{subject to } 1^T u =1, \ 1^T v =1, <br />
$$<br />
where H(y) is the Gibbs entropy <math> \sum_i y_i log y_i</math>.<br />
Entropy regularization guarantees the non-negative condition and makes the equilibrium continuous with respect to P, which means players are encouraged to play more randomly, and all actions have a non-zero probability. Moreover, this problem has a unique saddle point corresponding to <math> (u^*, v^*) </math>.<br />
<br />
Using a primal-dual Newton Method to solve the QRE for two-player zero-sum games, the KKT conditions for the problem are:<br />
$$ <br />
Pv + \log(u) + 1 +\mu 1 = 0 \\<br />
P^T v -\log(v) -1 +\nu 1 = 0 \\<br />
1^T u = 1, \ 1^T v = 1, <br />
$$<br />
where <math> (\mu, \nu) </math> are Lagrange multipliers for the equality constraints on u, v respectively. Then applying Newton's method gives the the update rule:<br />
$$<br />
Q \begin{bmatrix} \Delta u \\ \Delta v \\ \Delta \mu \\ \Delta \nu \\ \end{bmatrix} = - \begin{bmatrix} P v + \log u + 1 + \mu 1 \\ P^T u - \log v - 1 + \nu 1 \\ 1^T u - 1 \\ 1^T v - 1 \\ \end{bmatrix}, \qquad (2)<br />
$$<br />
where Q is the Hessian of the Lagrangian, given by <br />
$$ <br />
Q = \begin{bmatrix} <br />
diag(\frac{1}{u}) & P & 1 & 0 \\ <br />
P^T & -diag(\frac{1}{v}) & 0 & 1\\<br />
1^T & 0 & 0 & 0 \\<br />
0 & 1^T & 0 & 0 \\<br />
\end{bmatrix}. <br />
$$<br />
<br />
'''Differentiating Through QRE Solutions''':<br />
The QRE solver provides a method to compute the necessary Jacobian-vector products. Specifically, we compute the gradient of the loss given the solution <math> (u^*,v^*) </math> to the QRE, and some loss function <math> L(u^*,v^*) </math>: <br />
<br />
1. Take differentials of the KKT conditions: <br />
<math><br />
Q \begin{bmatrix} <br />
du & dv & d\mu & d\nu \\ <br />
\end{bmatrix} ^T = \begin{bmatrix} <br />
-dPv & -dP^Tu & 0 & 0 \\ <br />
\end{bmatrix}^T. \ <br />
</math><br />
<br />
2. For small changes du, dv, <br />
<math><br />
dL = \begin{bmatrix} <br />
v^TdP^T & u^TdP & 0 & 0 \\ <br />
\end{bmatrix} Q^{-1} \begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
3. Apply this to P, and take limits as dP is small:<br />
<math><br />
\nabla_P L = y_u v^T + u y_v^T, \qquad (3)<br />
</math> where <br />
<math><br />
\begin{bmatrix} <br />
y_u & y_v & y_{\mu} & y_{\nu}\\ <br />
\end{bmatrix}=Q^{-1}\begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
Hence, the forward pass is given by using the expression in (2) to solve for the logit equilibrium given P, and the backward pass is given by using <math> \nabla_u L </math> and <math> \nabla_v L </math> to obtain <math> \nabla_P L </math> using (3). There does not always exist a unique P which generates <math> u^*, v^* </math> under the logit QRE, and we cannot expect to recover P when under-constrained.<br />
<br />
== Learning Extensive form games ==<br />
<br />
The normal form representation for games where players have many choices quickly becomes intractable. For example, consider a chess game: One the first turn, player 1 has 20 possible moves and then player 2 has 20 possible responses. If in the following number of turns each player is estimated to have ~30 possible moves and if a typical game is 40 moves per player, the total number of strategies is roughly <math>10^{120} </math> per player (this is known as the Shannon number for game-tree complexity of chess) and so the payoff matrix for a typical game of chess must therefore have <math> O(10^{240}) </math> entries.<br />
<br />
Instead, it is much more useful to represent the game graphically as an "''' Extensive form game'''" (EFG). We'll also need to consider types of games where there is '''imperfect information''' - players do not necessarily have access to the full state of the game. An example of this is one-card poker: (1) Each player draws a single card from a 13-card deck (ignore suits) (2) Player 1 decides whether to bet/hold (3) Player 2 decides whether to call/raise (4) Player 1 must either call/fold if Player 2 raised. From this description, player 1 has <math> 2^{13} </math> possible first moves (all combinations of (card, raise/hold)) and has <math> 2^{13} </math> possible second moves (whenever player 1 gets a second move) for a total of <math> 2^{26} </math> possible strategies. In addition, Player 1 never knows what cards player 2 has and vice versa. So instead of representing the game with a huge payoff matrix, we can instead represent it as a simple decision tree (for a ''single'' drawn the card of player 1):<br />
<br />
<br />
<center> [[File:1cardpoker.PNG]] </center><br />
<br />
where player 1 is represented by "1", a node that has two branches corresponding to the allowed moves of player 1. However there must also be a notion of information available to either player: While this tree might correspond to say, player 1 holding a "9", it contains no information on what card player 2 is holding (and is much simpler because of this). This leads to the definition of an '''information set''': the set of all nodes belonging to a single player for which the other player cannot distinguish which node has been reached. The information set may therefore be treated as a node itself, for which actions stemming from the node must be chosen in ignorance of what the other player did immediately before arriving at the node. In the poker example, the full game tree consists of a much more complex version of the tree shown above (containing repetitions of the given tree for every possible combination of cards dealt) and the and an example of the information set for player 1 is the set of all of the nodes owned by player 2 that immediately follow player 1's decision to hold. In other words, if player 1 holds there are 13 possible nodes describing the responses of player 2 (raise/hold for player 2 having card = ace, 1, ... King), and all 13 of these nodes are indistinguishable to player 1, and so form the information set for player 1.<br />
<br />
The following is a review of important concepts for extensive form games first formalized in [2]. Let <math> \mathcal{I}_i </math> be the set of all information sets for player i, and for each <math> t \in \mathcal{I}_i </math> let <math> \sigma_t </math> be the actions taken by player i to arrive at <math> t </math> and <math> C_t </math> be the actions that player i can take from <math> u </math>. Then the set of all possible sequences that can be taken by player i is given by<br />
<br />
$$<br />
S_i = \{\emptyset \} \cup \{ \sigma_t c | u\in \mathcal{I}_i, c \in C_t \}<br />
$$<br />
<br />
So for the one-card poker we would have <math>S_1 = \{\emptyset, \text{raise}, \text{hold}, \text{hold-call}, \text{hold-fold\} }</math>. From the possible sequences follows two important concepts:<br />
<ol><br />
<li>The EFG '''payoff matrix''' <math> P </math> of size <math>|S_1| \times |S_2| </math> (this is all possible actions available to either player), is populated with rewards from each leaf of the tree (or "zero" for each <math> (s_1, s_2) </math> that is an invalid pair), and the expected payoff for realization plans <math> (u, v) </math> is given by <math> u^T P v </math> </li><br />
<li> A '''realization plan''' <math> u \in \mathbb{R}^{|S_1|} </math> for player 1 (<math> v \in \mathbb{R}^{|S_2|} </math> for player 2 ) will describe probabilities for players to carry out each possible sequence, and each realization plan must be constrained by (i) compatibility of sequences (e.g. "raise" is not compatible with "hold-call") and (ii) information sets available to the player. These constraints are linear:<br />
<br />
$$<br />
Eu = e \\<br />
Fv = f<br />
$$<br />
<br />
where <math> e = f = (1, 0, ..., 0)^T </math> and <math> E, F</math> contain entries in <math> {-1, 0, 1} </math> describing compatibility and information sets. </li><br />
<br />
</ol> <br />
<br />
<br />
The paper's main contribution is to develop a minimax problem for extensive form games:<br />
<br />
<br />
$$<br />
\min_u \max_v u^T P v + \sum_{t\in \mathcal{I}_1} \sum_{c \in C_t} u_c \log \frac{u_c}{u_{p_t}} - \sum_{t\in \mathcal{I}_2} \sum_{c \in C_t} v_c \log \frac{v_c}{v_{p_t}}<br />
$$<br />
<br />
where <math> p_t </math> is the action immediately preceding information set <math> t </math>. Intuitively, each sum resembles a cross-entropy over the distribution of probabilities in the realization plan comparing each probability to proceed from the information set to the probability to arrive at that information set. Importantly, these entropies are strictly convex or concave (for player 1 and player 2 respectively) [3] so that the min-max problem will have a unique solution and ''the objective function is continuous and continuously differentiable'' - this means there is a way to optimize the function. As noted in Theorem 1 of [1], the solution to this problem is equivalently a solution for the QRE of the game in reduced normal form.<br />
<br />
Minimax can also be seen from an algorithmic perspective. Referring to the above figure containing a tree, it contains a sequence of states and action which alternates between two or more competing players. The above formulation of the min-max problem essentially measures how well a decision rule is from the perspective of a single player. To describe it in terms of the tree, if it is player 1's turn, then it is a mutual recursion of player 1 choosing to maximize its payoff and player 2 choosing to minimize player 1's payoff.<br />
<br />
Having decided on a cost function, the method of Lagrange multipliers may be used to construct the Lagrangian that encodes the known constraints (<math> Eu = e \,, Fv = f </math>, and <math> u, v \geq 0</math>), and then optimize the Lagrangian using Newton's method (identically to the previous section). Accounting for the constraints, the Lagrangian becomes <br />
<br />
<br />
$$<br />
\mathcal{L} = g(u, v) + \sum_i \mu_i(Eu - e)_i + \sum_i \nu_i (Fv - f)_i<br />
$$<br />
<br />
where <math>g</math> is the argument from the minimax statement above and <math>u, v \geq 0</math> become KKT conditions. The general update rule for Newton's method may be written in terms of the derivatives of <math> \mathcal{L} </math> with respect to primal variables <math>u, v </math> and dual variables <math> \mu, \nu</math>, yielding:<br />
<br />
$$<br />
\nabla_{u,v,\mu,\nu}^2 \mathcal{L} \cdot (\Delta u, \Delta v, \Delta \mu, \Delta \nu)^T= - \nabla_{u,v,\mu,\nu} \mathcal{L}<br />
$$<br />
where <math>\nabla_{u,v,\mu,\nu}^2 \mathcal{L} </math> is the Hessian of the Lagrangian and <math>\nabla_{u,v,\mu,\nu} \mathcal{L} </math> is simply a column vector of the KKT stationarity conditions. Combined with the previous section, this completes the goal of the paper: To construct a differentiable problem for learning normal form and extensive form games.<br />
<br />
== Experiments ==<br />
<br />
The authors demonstrated learning on extensive form games in the presence of ''side information'', with ''partial observations'' using three experiments. In all cases, the goal was to maximize the likelihood of realizing an observed sequence from the player, assuming they act in accordance with the QRE. The authors found that the best way to implement the module was to use a medium to large batch size, RMSProp, or Adam optimizers with a learning rate between <math>\left[0.0001,0.01\right].</math> <br />
<br />
=== Rock, Paper, Scissors ===<br />
<br />
Rock, Paper, Scissors is a 2-player zero-sum game. For this game, the best strategy to reach a Nash equilibrium and a Quantal response equilibrium is to uniformly play each hand with equal odds.<br />
The first experiment was to learn a non-symmetric variant of Rock, Paper, Scissors with ''incomplete information'' with the following payoff matrix:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix of modified Rock-Paper-Scissors''<br />
! <br />
! ''Rock''<br />
! ''Paper''<br />
! ''Scissors''<br />
|-<br />
! ''Rock''<br />
| '''''0'''''<br />
| <math>-b_1</math><br />
| <math>b_2</math><br />
|-<br />
! ''Paper''<br />
| <math>b_1</math><br />
| '''''0'''''<br />
| <math>-b_3</math><br />
|-<br />
! ''Scissors''<br />
| <math>-b_2</math><br />
| <math>b_3</math><br />
| '''''0'''''<br />
|}<br />
<br />
where each of the <math> b </math> ’s are linear function of some features <math> x \in \mathbb{R}^{2} </math> (i.e., <math> b_y = x^Tw_y </math>, <math> y \in </math> {<math>1,2,3</math>} , where <math> w_y </math> are to be learned by the algorithm). Using many trials of random rewards the technique produced the following results for optimal strategies[1]: <br />
<br />
[[File:RPS Results.png|500px ]]<br />
<br />
From the graphs above, we can tell the following: 1) both parameters learned and predicted strategies improve with a larger dataset; 2) with a reasonably sized dataset, >1000 here, convergence is stable and is fairly quick.<br />
<br />
=== One-Card Poker ===<br />
<br />
Next they investigated extensive form games using the one-Card Poker (with ''imperfect information'') introduced in the previous section. In the experimental setup, they used a deck stacked non-uniformly (meaning repeat cards were allowed). Their goal was to learn this distribution of cards from observations of many rounds of the play. Different from the distribution of cards dealt, the method built in the paper is more suited to learn the player’s perceived or believed distribution of cards. It may even be a function of contextual features such as demographics of players. Three experiments were run with <math> n=4 </math>. Each experiment comprised 5 runs of training, with same weights but different training sets. Let <math> d \in \mathbb{R}^{n}, d \ge 0, \sum_{i} d_i = 1 </math> be the weights of the cards. The probability that the players are dealt cards <math> (i,j) </math> is <math> \frac{d_i d_j}{1-d_i} </math>. This distribution is asymmetric between players. Matrix <math> P, E, F </math> for the case <math> n=4 </math> are presented in [1]. With training for 2500 epochs, the mean squared error of learned parameters (card weights, <math> u, v </math> ) are averaged over all runs of and are presented as following [1]: <br />
<br />
<br />
[[File:One-card_Poker_Results.png|500px ]]<br />
<br />
=== Security Resource Allocation Game ===<br />
<br />
<br />
From Security Resource Allocation Game, they demonstrated the ability to learn from ''imperfect observations''. The defender possesses <math> k </math> indistinguishable and indivisible defensive resources which he splits among <math> n </math> targets, { <math> T_1, ……, T_n </math>}. The attacker chooses one target. If the attack succeeds, the attacker gets <math> R_i </math> reward and defender gets <math> -R_i </math>, otherwise zero payoff for both. If there are n defenders guarding <math> T_i </math>, probability of successful attack on <math> T_i </math> is <math> \frac{1}{2^n} </math>. The expected payoff matrix when <math> n = 2, k = 3 </math>, where the attackers are the row players is:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix when <math> n = 2, k = 3 </math>''<br />
! {#<math>D_1</math>,#<math>D_2</math>}<br />
! {0, 3}<br />
! {1, 2}<br />
! {2, 1}<br />
! {3, 0}<br />
|-<br />
! <math>T_1</math><br />
| <math>-R_1</math><br />
| <math>-\frac{1}{2}R_1</math><br />
| <math>-\frac{1}{4}R_1</math><br />
| <math>-\frac{1}{8}R_1</math><br />
|-<br />
! <math>T_2</math><br />
| <math>-\frac{1}{8}R_2</math><br />
| <math>-\frac{1}{4}R_2</math><br />
| <math>-\frac{1}{2}R_2</math><br />
| <math>-R_2</math><br />
|} <br />
<br />
<br />
For a multi-stage game the attacker can launch <math> t </math> attacks, one in each stage while defender can only stick with stage 1. The attacker may change target if the attack in stage 1 is failed. Three experiments are run with <math> n = 2, k = 5 </math> for games with single attack and double attack, i.e, <math> t = 1 </math> and <math> t = 2 </math>. The results of simulated experiments are shown below [1]:<br />
<br />
[[File:Security Game Results.png|500px ]]<br />
<br />
<br />
They learned <math> R_i </math> only based on observations of the defender’s actions and could still recover the game set by only observing the defender’s actions. Same as expectation, the larger dataset size improves the learned parameters. Two outliers are 1) Security Game, the green plot for when <math> t = 2 </math>; and 2) RPS, when comparing between training sizes of 2000 and 5000.<br />
<br />
== Conclusion ==<br />
Unsurprisingly, the results of this study show that in general, the quality of learned parameters improved as the number of observations increased. The network presented in this paper demonstrated improvement over the existing methodology. <br />
<br />
This paper presents an end-to-end framework for implementing a game solver, for both extensive and normal form, as a module in a deep neural network for zero-sum games. This method, unlike many previous works in this area, does not require the parameters of the game to be known to the agent prior to the start of the game. The two-part method analytically computes both the optimal solution and the parameters of the game. Future work involves taking advantage of the KKT matrix structure to increase computation speed and extensions to the area of learning general-sum games.<br />
<br />
== Critiques ==<br />
The proposed method appears to suffer from two flaws. Firstly, the assumption that players behave in accordance with the QRE severely limits the space of player strategies and is known to exhibit pathological behavior even in one-player settings. Second, the solvers are computationally inefficient and are unable to scale. For this setting, they used Newton's method and it is found that the second-order algorithms do not scale to large games as with Nash equilibrium [4]. <br />
<br />
In the one-card poker section, it might be better to write "Next, they investigated extensive form games ... It may even be a function of contextual features such as the demographics of players. ... 5 runs of training, with the same weights but different training sets."<br />
<br />
Zero-Sum Normal Form Games usually follow Gaussian distributions with two non-empty sets of strategies of player one and player two correspondingly, and also a payoff function defined on the set of possible realizations.<br />
<br />
This method of proposing that real players will follow a laid out scheme or playing strategy is almost always a drastic oversimplification of real-world scenarios and severely limits the applications. In order to better fit a model to the real players, there almost always has to be some sort of stochastic implementation which accounts for the presence of irrational users in the system. <br />
<br />
It might be a good idea to discuss the algorithm performance based on more complicated player reactions, for example, player 2 or NPC from the game might react based on player 1's action. If player 2 would react with a "clever" choice, we might be able to shrink the choice space, which might be helpful to accelerate the training time.<br />
<br />
The theory assumes rational players, which means that roughly speaking, the players make decisions based on increasing their respective payoffs (utility values, preferences,..). However, in real-life scenarios, this might not hold true. Thus, it would be a good idea to include considerations regarding this issue in this summary. Another area that is worth exploring is the need for things like “bluffing” in games with hidden information. It would be interesting to see whether the results generated in this paper are in support of "bluffing" or not.<br />
<br />
It is good that the authors of this paper provided analysis on different types of games, but it would be great if they could also provide some future insights on this method such as whether it can be applied to more complex games such as blackjack poker / it would work or not.<br />
<br />
The authors evaluated the performance by plotting the mean squared error (MSE) of the training data. Prediction error on testing data can also be provided to better evaluate the prediction power.<br />
<br />
== References ==<br />
<br />
[1] Ling, C. K., Fang, F., & Kolter, J. Z. (2018). What game are we playing? end-to-end learning in normal and extensive form games. arXiv preprint arXiv:1805.02777.<br />
<br />
[2] B. von Stengel. Efficient computation of behavior strategies.Games and Economics Behavior,14(0050):220–246, 1996.<br />
<br />
[3] Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.<br />
<br />
[4] Farina, G., Kroer, C., & Sandholm, T. (2019). Online Convex Optimization for Sequential Decision Processes and Extensive-Form Games. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 1917-1925. https://doi.org/10.1609/aaai.v33i01.33011917<br />
<br />
[5] Larson, K. Game-theoretic methods for computer science Normal Form Games. Lecture presented at CS 886 in University of Waterloo, Waterloo. Retrieved from https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiL24iwprrtAhWJFFkFHU52D0MQFjAQegQIIhAC&url=https://cs.uwaterloo.ca/~klarson/teaching/F06-886/slides/normalform.pdf&usg=AOvVaw0j9fw9oRkUGZOsY8ZCyv6i</div>X277lu