statwiki - User contributions [US]

Speech2Face: Learning the Face Behind a Voice

2020-12-07T10:08:04Z

Y87yu: /* References */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face which utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model produces facial reconstruction images that capture specific physical attributes learning the correlations between faces and voices, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. This model explores what types of facial information could be extracted from speech without the constraints of predefined facial characterizations. Without any prior information or accurate classifiers, the reconstructions revealed correlations between craniofacial features and voice in addition to the correlation between dominant features (gender, age, ethnicity, etc.) and voice. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Ethical Considerations ==

The authors note that due to the potential sensitivity of facial information, they have chosen to explicitly state some ethical considerations. The first of which is privacy. The paper states that the method cannot recover the true identity of the face or produce faces of specific individuals, but rather will show average-looking faces. The paper also addresses that there are potential dataset biases that exist for the voice-face correlations, thus the faces may not accurately represent the intended population. The paper recommends that any further investigation or practical use of this technology will be tested to represent the intended population and also if the data does not reflect this, more representative data should be broadly collected. Finally, it acknowledges that the model uses demographic categories such as "White" and "Asian" that are defined by a commercial face attribute classifier.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, separate speech from multiple concurrent sources, predict lip motion from speech, and even learn the emotion of the agents based on their voices. Aytar et al. [6] proposed a student-teacher training procedure in which a well established visual recognition model was used to transfer the knowledge obtained in the visual modality to the sound modality, using unlabeled videos.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. A generative adversarial network (GAN) model is one that uses a generator to produce seemingly possible data for training and a discriminator that identifies if the training data is fabricated by the generator or if it is real [7]. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
It seems to be a common trait among humans to imagine what some people look like when we hear their voices before we have seen what they look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors are more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg|center]]

<div style="text-align:center;"> Figure 1. '''Speech2Face model and training pipeline''' </div>

The Speech2Face Model consists of two parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The combination of the voice encoder and face decoder results are combined to form an image. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. It needs a model to figure out many irrelevant variations in the data, and to implicitly extract important internal representations of faces. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face.

'''Face Decoder'''
The face decoder itself was taken from previous work The VGG-Face model by Cole et al [3] (a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network.) and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The face decoder kept the VGG-Face model's dimension and weights. The weights were also trained separately and remained fixed during the voice encoder training.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG|center]]

<div style="text-align:center;"> Table 1: '''Voice encoder architecture''' </div>

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time and allows the model to be used for input speeches of varying lengths. Two fully connected layers at the end are used to return a 4096-dimensional facial feature output.

'''Training'''

The AVSpeech dataset, a large-scale audio-visual dataset is used for the training. AVSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair color, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame that contains the face is extracted from each video and then inputted to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096-dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the VGG-Face model <math>f_{VGG}</math>: <math> \mathbb{R}^{4096} \to \mathbb{R}^{2622}</math> and the first layer of face decoder <math>f_{dec}</math> : <math> \mathbb{R}^{4096} \to \mathbb{R}^{1000}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}(a)\text{log}p_{(i)}(b)$$ $$p_{(i)}(a) = \frac{\text{exp}(a_i/T)}{\sum_j \text{exp}(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

<div style="text-align:center;"> Figure 2: '''Qualitative results on the AVSpeech test set''' </div>

'''Implementation Details'''

6 seconds of audio was used to compute the spectogram by taking a Short-time Fourier transform with Hann window of 25mm, hop length of 10ms, and 512 FFT frequenct bands. A CNN-based face detector from Dlib was used to crop the face regions from the frames. The VGG-face features are computed from the resized faces and together with the spectrogram was used for training. There were a total of 1.7 and 0.15 million spectra-feature pairs.

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

<div style="text-align:center;"> Figure 3. '''Facial attribute evaluation''' </div>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

<div style="text-align:center;"> Table 2. '''Feature similarity''' </div>

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2F -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

<div style="text-align:center;"> Table 3. '''S2F -> Face retrieval performance''' </div>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, measures the probability that the K closest images to the model output includes the correct image of the speaker's face. A higher R@K score indicates better performance. From Table 3, above, we see that both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.

'''Additional Observations'''

Ablation studies were carried out to test the effect of audio duration and batch normalization. It was found that the duration of input audio during the training stage had little effect on convergence speed (comparing 3 and 6-second speech segments), while in the test stage longer input speech yields improvement in reconstruction quality. With respect to batch normalization (BN), it was found that without BN reconstructed faces would converge to an average face, while the inclusion of BN led to results which contained much richer facial features.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

There is evidence that the results of the model may be heavily influenced by external factors:

1. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the model's prediction of ethnicity towards white. The bias in the results shows that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Figure (11) highlights this shortcoming: The same man heard speaking in either English or Chinese was predicted to have a "white" appearance or an "asian" appearance respectively.

2. The model was shown to infer different face features based on language. This puts into question how heavily the model depends on the spoken language. The paper mentioned the quality of face reconstruction may be affected by uncommon languages, where English is the most popular language on Youtube(training set). Testing a more controlled sample where all speech recording was of the same language may help address this concern to determine the model's reliance on spoken language.

3. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender, and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face and may result in a compounding effect on the miss-classification rate.

4. Figure 4.b shows the AVSpeech dataset statistics. However, it doesn't show the statistics about speakers' ethnicity and the language of the video. If we train the model with a more comprehensive dataset that includes enough Asian/Indian English speakers and native language speakers will this increase the accuracy?

5. One concern about the source of the training data, i.e. the Youtube videos, is that resolution varies a lot since the videos are randomly selected. That may be the reason why the proposed model performs badly on some certain features. For example, it is hard to tell the age when the resolution is bad because the wrinkles on the face are neglected.

6. The topic of this project is very interesting, but I highly doubt this model will be practical in real-world problems. Because there are many factors to affect a person's sound in a real-world environment. Sounds such as phone clock, TV, car horn and so on. These sounds will decrease the accuracy of the predicted result of the model.

7. A lot of information can be obtained from someone's voice, this can potentially be useful for detective work and crime scene investigation. In our world of increasing surveillance, public voice recording is quite common and we can reconstruct images of potential suspects based on their voice. In order for this to be achieved, the model has to be thoroughly trained and tested to avoid false positives as it could have a highly destructive outcome for a falsely convicted suspect.

8. This is a very interesting topic, and this summary has a good structure for readers. Since this model uses Youtube to train model, but I think one problem is that most of the YouTubers are adult, and many additional reasons make this dataset highly unbalanced. What is more, some people may have a baby voice, this also could affect the performance of the model. But overall, this is a meaningful topic, it might help police to locate the suspects. So it might be interesting to apply this to the police.

9. In addition, it seems very unlikely that any results coming from this model would ever be held in regard even remotely close to being admissible in court to identify a person of interest until the results are improved and the model can be shown to work in real-world applications. Otherwise, there seems to be very little use for such technology and it could have negative impacts on people if they were to be depicted in an unflattering way by the model based on their voice.

10. Using voice as a factor of constructing the face is a good idea, but it seems like the data they have will have lots of noise and bias. The voice of a video might not come from the person in the video. There are so many YouTubers adjusting their voices before uploading their video and it's really hard to know whether they adjust their voice. Also, most YouTubers are adults so the model cannot have enough training samples about teenagers and kids.

11. It would be interesting to see how the performance changes with different face encoding sizes (instead of just 4096-D) and also difference face models (encoder/decoders) to see if better performance can be achieved. Also given that the dataset used was unbalanced, was the dataset used to train the face model the same dataset? or was a different dataset used (the model was pretrained). This could affect the performance of the model as well.

12. The audio input is transformed into a spectrogram before being used for training. They use STFT with a Hann window of 25 mm, a hop length of 10 ms, and 512 FFT frequency bands. They cite this method from a paper that focuses on speech separation, not speech classification. So, it would be interesting to see if there is a better way to do STFT, possibly with different hyperparameters (eg. different windowing, different number of bands), or if another type of transform (eg. wavelet transform) would have better results.

13. A easy way to get somewhat balanced data is to duplicate the data that are fewer.

14. This problem is interesting but is hard to generalize. This algorithm didn't account for other genders and mixed-race. In addition, the face recognition software Face++ introduces bias which can carry forward to Speech2Face algorithm. Face recognition algorithms are known to have higher error rates classifying darker-skinned individuals. Thus, it'll be tough to apply it to real-life scenarios like identifying suspects.

15. This experiment raises a lot of ethical complications when it comes to possible applications in the real world. Even if this model was highly accurate, the implications of being able to discern a person's racial ethnicity, skin tone, etc. based solely on there voice could play in to inherent biases in the application user and this may end up being an issue that needs to be combatted in future research in this area. Another possible issue is that many people will change their intonation or vocal features based on the context (I'll likely have a different voice pattern in a job interview in terms of projection, intonation, etc. than if I was casually chatting/mumbling with a friend while playing video games for example).

16. Overall a very interesting topic. I want to talk about the technical challenged raised by using the AVSSpeech dataset for training. The paper acknowledges that the AVSSpeech is unbalanced, and 80% of the data are white and Asians. It also says in the results section that "Our model does not perform on other races due to the imbalance in data". There does not seem to be any effort made in balancing the data. I think that there are definitely some data processing techniques that can be used (filtering, data augmentation, etc) to address the class imbalance problem. Not seeing any of these in the paper is a bit disappointing. Another issue I have noticed is that the model aims to predict an average-looking face from certain gender/racial group from voice input, due to ethical considerations. If we cannot reveal the identify of a person, why don't we predict the gender and race directly? Giving an average-looking face does not seem to be the most helpful.

17. Very interesting research paper to be studied and the main objective was also interesting. This research leads to open question which can be applied to another application such as predicting person's face using voice and can be used in more advanced way. The only risk is how the data is obtained from YouTube where data is not consistent.

18. The essay uses millions of natural videos of people speaking to find the correlation between face and voice. Since face and voice are commonly used as the identity of a person, there are many possible research opportunities and applications about improving voice and face unlock.

19. It would be better to have a future work section to discuss the current shortage and explore the possible improvement and applications in the future.

20. While the idea behind Speech2Face is interesting, ethnic profiling is a huge concern and it can further lead to racial discrimination, racism etc. Developers must put more care and thought into applying Speech2Face in tech before deploying the products.

21. It would be helpful if the author could explore the different applications of this project in real life. Speech2face can be helpful during criminal investigation and essentially in scenarios when someone's picture is missing and only voice is available. It would also be helpful if the author could state the importance and need of such kind project in the society.

22. The authors mention that they use the AVSpeech dataset for both training and testing but do not talk about how they split the data. It is possible that the same speakers were used in the training and testing data and so the model is able to recreate a face simply by matching the observed face to the observed audio. This would explain the striking example images shown in the paper.

23. Another interesting application of this research is automated speech or facial animation at scale or in multiple languages. The cutting-edge automated facial animation solution provided by JALI Research Inc is applied in Cyberpunk 2077.

24. It would be interesting to know the model can predict a similar face when one is speaking different languages. A person who is speaking multiple languages can have different tones and accents depending on a language that they speak.

25. The results are actually amazing for the introduction of Speech2Face. As others have mentioned, the researchers might have used a biased dataset of YouTube videos favoring certain ethnicities and their accents and dialects. Thus, it would be nice to also see the data distribution. Additionally it would be nice to see how their model reacts to people who are able to speak multiple languages and see how well Speech2Face generalizes different language pronunciations of one person.

26. The paper introduces Speech2Face and it definitely is one of the major areas of researches in the future. In the paper, the confusion matrix indicates that the model tends to misclassify based on the age of the speaking person. Specifically, the model tends to misclassify between 40-70. It would be interesting to see if the model could improve on its bottleneck by training on more speeches by the age group 40-70.

27. An interesting topic, and as others have mentioned, has many ethical considerations and implications. Particularly in regions where call-recording is permitted, there is dangerous potential to for the technology to be misused to identify and target individuals. It would also be interesting to get a more in depth exploration into how the language spoken and accents have a bias. For example, if a person speaks with a strong British accent, are they classified as white? Particularly for Spanish-speakers, they vary greatly with respect to their skin colour and features, how well does the algorithm work on these individuals. A last nit-pick is the labelling used (i.e. Asian, White, Indian, Black) as this is not accurate since Indians, and moreover South Asians, fall under Asian as well.

28. This topic is quite interesting and it could have great contribution in terms of criminal fight. But as the result, the accuracy is essential. There is still the space for much improvements since to tell a person's face by his/her voice is pretty hard since there are many factors such as oral structure, the language environment and even personality. Great bias could be resulted from these unpredictable factors.

29. This is an interesting topic and could have great use in terms of finding criminals or people when having their voice recorded. However, the voice recording might be noisy and some might include voices of multiple people. It could consider ways to eliminate those factors that might effect the accuracy of the face generation.

30. Most contents described in the paper are very useful. However, YouTube might not be a good enough data source since there are fewer labels to classify. Perhaps, after generating the model, the transfer learning could be done based on Facebook's videos in order to solve the imbalanced problem.

31. This topic is really incredibly interesting and the writers should commend themselves on a job well done. However, Youtube, not only is it an ethnically skewed dataset, but has a non-negligible number of creators who use voice modifiers, auto tune, or a number of other things to change the pitch of their voices, which may lead to the significantly more errors in practical applications. A better dataset to be used could be Skype video calls, or a class room study. Also, judging from the way the model does it's prediction, it seems very prone to overfitting on the dataset, and will not generalize well, since pitch and sound are both incredibly variable across humans.

32. One thing to notice is that the training data used to train the model is downloaded from Youtube, which may be a good site to retrieve a large amount of data. While it allows the possibility that the voices retrieved does not match with the people who made those sounds, claimed by the video. If that is the case, those records will become dirty data, and needs to be cleaned before training the model. Otherwise, there will be some huge misclassifications because some of the training data is not making sense. One way I can think of to improve this problem is that we may train multiple models on different subsets of the original dataset, and combine the results of all the models by taking weighted average.

33. Predicting appearance with sound is a very imaginative research direction. But the author did not explain how to exclude environmental factors in data preprocessing, such as light intensity, facial dress, facial wounds, etc. In the training data set, different sound and image resolutions also affect the effectiveness of the model. The author needs more robustness tests to exclude these factors.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.

[6] “Overview of GAN Structure | Generative Adversarial Networks,” ''Google Developers'', 2019. [Online]. Available: https://developers.google.com/machine-learning/gan/gan_structure. [Accessed: 02-Dec-2020].

Speech2Face: Learning the Face Behind a Voice

2020-12-07T10:07:55Z

Y87yu: /* References */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face which utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model produces facial reconstruction images that capture specific physical attributes learning the correlations between faces and voices, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. This model explores what types of facial information could be extracted from speech without the constraints of predefined facial characterizations. Without any prior information or accurate classifiers, the reconstructions revealed correlations between craniofacial features and voice in addition to the correlation between dominant features (gender, age, ethnicity, etc.) and voice. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Ethical Considerations ==

The authors note that due to the potential sensitivity of facial information, they have chosen to explicitly state some ethical considerations. The first of which is privacy. The paper states that the method cannot recover the true identity of the face or produce faces of specific individuals, but rather will show average-looking faces. The paper also addresses that there are potential dataset biases that exist for the voice-face correlations, thus the faces may not accurately represent the intended population. The paper recommends that any further investigation or practical use of this technology will be tested to represent the intended population and also if the data does not reflect this, more representative data should be broadly collected. Finally, it acknowledges that the model uses demographic categories such as "White" and "Asian" that are defined by a commercial face attribute classifier.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, separate speech from multiple concurrent sources, predict lip motion from speech, and even learn the emotion of the agents based on their voices. Aytar et al. [6] proposed a student-teacher training procedure in which a well established visual recognition model was used to transfer the knowledge obtained in the visual modality to the sound modality, using unlabeled videos.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. A generative adversarial network (GAN) model is one that uses a generator to produce seemingly possible data for training and a discriminator that identifies if the training data is fabricated by the generator or if it is real [7]. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
It seems to be a common trait among humans to imagine what some people look like when we hear their voices before we have seen what they look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors are more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg|center]]

<div style="text-align:center;"> Figure 1. '''Speech2Face model and training pipeline''' </div>

The Speech2Face Model consists of two parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The combination of the voice encoder and face decoder results are combined to form an image. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. It needs a model to figure out many irrelevant variations in the data, and to implicitly extract important internal representations of faces. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face.

'''Face Decoder'''
The face decoder itself was taken from previous work The VGG-Face model by Cole et al [3] (a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network.) and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The face decoder kept the VGG-Face model's dimension and weights. The weights were also trained separately and remained fixed during the voice encoder training.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG|center]]

<div style="text-align:center;"> Table 1: '''Voice encoder architecture''' </div>

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time and allows the model to be used for input speeches of varying lengths. Two fully connected layers at the end are used to return a 4096-dimensional facial feature output.

'''Training'''

The AVSpeech dataset, a large-scale audio-visual dataset is used for the training. AVSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair color, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame that contains the face is extracted from each video and then inputted to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096-dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the VGG-Face model <math>f_{VGG}</math>: <math> \mathbb{R}^{4096} \to \mathbb{R}^{2622}</math> and the first layer of face decoder <math>f_{dec}</math> : <math> \mathbb{R}^{4096} \to \mathbb{R}^{1000}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}(a)\text{log}p_{(i)}(b)$$ $$p_{(i)}(a) = \frac{\text{exp}(a_i/T)}{\sum_j \text{exp}(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

<div style="text-align:center;"> Figure 2: '''Qualitative results on the AVSpeech test set''' </div>

'''Implementation Details'''

6 seconds of audio was used to compute the spectogram by taking a Short-time Fourier transform with Hann window of 25mm, hop length of 10ms, and 512 FFT frequenct bands. A CNN-based face detector from Dlib was used to crop the face regions from the frames. The VGG-face features are computed from the resized faces and together with the spectrogram was used for training. There were a total of 1.7 and 0.15 million spectra-feature pairs.

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

<div style="text-align:center;"> Figure 3. '''Facial attribute evaluation''' </div>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

<div style="text-align:center;"> Table 2. '''Feature similarity''' </div>

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2F -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

<div style="text-align:center;"> Table 3. '''S2F -> Face retrieval performance''' </div>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, measures the probability that the K closest images to the model output includes the correct image of the speaker's face. A higher R@K score indicates better performance. From Table 3, above, we see that both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.

'''Additional Observations'''

Ablation studies were carried out to test the effect of audio duration and batch normalization. It was found that the duration of input audio during the training stage had little effect on convergence speed (comparing 3 and 6-second speech segments), while in the test stage longer input speech yields improvement in reconstruction quality. With respect to batch normalization (BN), it was found that without BN reconstructed faces would converge to an average face, while the inclusion of BN led to results which contained much richer facial features.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

There is evidence that the results of the model may be heavily influenced by external factors:

1. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the model's prediction of ethnicity towards white. The bias in the results shows that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Figure (11) highlights this shortcoming: The same man heard speaking in either English or Chinese was predicted to have a "white" appearance or an "asian" appearance respectively.

2. The model was shown to infer different face features based on language. This puts into question how heavily the model depends on the spoken language. The paper mentioned the quality of face reconstruction may be affected by uncommon languages, where English is the most popular language on Youtube(training set). Testing a more controlled sample where all speech recording was of the same language may help address this concern to determine the model's reliance on spoken language.

3. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender, and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face and may result in a compounding effect on the miss-classification rate.

4. Figure 4.b shows the AVSpeech dataset statistics. However, it doesn't show the statistics about speakers' ethnicity and the language of the video. If we train the model with a more comprehensive dataset that includes enough Asian/Indian English speakers and native language speakers will this increase the accuracy?

5. One concern about the source of the training data, i.e. the Youtube videos, is that resolution varies a lot since the videos are randomly selected. That may be the reason why the proposed model performs badly on some certain features. For example, it is hard to tell the age when the resolution is bad because the wrinkles on the face are neglected.

6. The topic of this project is very interesting, but I highly doubt this model will be practical in real-world problems. Because there are many factors to affect a person's sound in a real-world environment. Sounds such as phone clock, TV, car horn and so on. These sounds will decrease the accuracy of the predicted result of the model.

7. A lot of information can be obtained from someone's voice, this can potentially be useful for detective work and crime scene investigation. In our world of increasing surveillance, public voice recording is quite common and we can reconstruct images of potential suspects based on their voice. In order for this to be achieved, the model has to be thoroughly trained and tested to avoid false positives as it could have a highly destructive outcome for a falsely convicted suspect.

8. This is a very interesting topic, and this summary has a good structure for readers. Since this model uses Youtube to train model, but I think one problem is that most of the YouTubers are adult, and many additional reasons make this dataset highly unbalanced. What is more, some people may have a baby voice, this also could affect the performance of the model. But overall, this is a meaningful topic, it might help police to locate the suspects. So it might be interesting to apply this to the police.

9. In addition, it seems very unlikely that any results coming from this model would ever be held in regard even remotely close to being admissible in court to identify a person of interest until the results are improved and the model can be shown to work in real-world applications. Otherwise, there seems to be very little use for such technology and it could have negative impacts on people if they were to be depicted in an unflattering way by the model based on their voice.

10. Using voice as a factor of constructing the face is a good idea, but it seems like the data they have will have lots of noise and bias. The voice of a video might not come from the person in the video. There are so many YouTubers adjusting their voices before uploading their video and it's really hard to know whether they adjust their voice. Also, most YouTubers are adults so the model cannot have enough training samples about teenagers and kids.

11. It would be interesting to see how the performance changes with different face encoding sizes (instead of just 4096-D) and also difference face models (encoder/decoders) to see if better performance can be achieved. Also given that the dataset used was unbalanced, was the dataset used to train the face model the same dataset? or was a different dataset used (the model was pretrained). This could affect the performance of the model as well.

12. The audio input is transformed into a spectrogram before being used for training. They use STFT with a Hann window of 25 mm, a hop length of 10 ms, and 512 FFT frequency bands. They cite this method from a paper that focuses on speech separation, not speech classification. So, it would be interesting to see if there is a better way to do STFT, possibly with different hyperparameters (eg. different windowing, different number of bands), or if another type of transform (eg. wavelet transform) would have better results.

13. A easy way to get somewhat balanced data is to duplicate the data that are fewer.

14. This problem is interesting but is hard to generalize. This algorithm didn't account for other genders and mixed-race. In addition, the face recognition software Face++ introduces bias which can carry forward to Speech2Face algorithm. Face recognition algorithms are known to have higher error rates classifying darker-skinned individuals. Thus, it'll be tough to apply it to real-life scenarios like identifying suspects.

15. This experiment raises a lot of ethical complications when it comes to possible applications in the real world. Even if this model was highly accurate, the implications of being able to discern a person's racial ethnicity, skin tone, etc. based solely on there voice could play in to inherent biases in the application user and this may end up being an issue that needs to be combatted in future research in this area. Another possible issue is that many people will change their intonation or vocal features based on the context (I'll likely have a different voice pattern in a job interview in terms of projection, intonation, etc. than if I was casually chatting/mumbling with a friend while playing video games for example).

16. Overall a very interesting topic. I want to talk about the technical challenged raised by using the AVSSpeech dataset for training. The paper acknowledges that the AVSSpeech is unbalanced, and 80% of the data are white and Asians. It also says in the results section that "Our model does not perform on other races due to the imbalance in data". There does not seem to be any effort made in balancing the data. I think that there are definitely some data processing techniques that can be used (filtering, data augmentation, etc) to address the class imbalance problem. Not seeing any of these in the paper is a bit disappointing. Another issue I have noticed is that the model aims to predict an average-looking face from certain gender/racial group from voice input, due to ethical considerations. If we cannot reveal the identify of a person, why don't we predict the gender and race directly? Giving an average-looking face does not seem to be the most helpful.

17. Very interesting research paper to be studied and the main objective was also interesting. This research leads to open question which can be applied to another application such as predicting person's face using voice and can be used in more advanced way. The only risk is how the data is obtained from YouTube where data is not consistent.

18. The essay uses millions of natural videos of people speaking to find the correlation between face and voice. Since face and voice are commonly used as the identity of a person, there are many possible research opportunities and applications about improving voice and face unlock.

19. It would be better to have a future work section to discuss the current shortage and explore the possible improvement and applications in the future.

20. While the idea behind Speech2Face is interesting, ethnic profiling is a huge concern and it can further lead to racial discrimination, racism etc. Developers must put more care and thought into applying Speech2Face in tech before deploying the products.

21. It would be helpful if the author could explore the different applications of this project in real life. Speech2face can be helpful during criminal investigation and essentially in scenarios when someone's picture is missing and only voice is available. It would also be helpful if the author could state the importance and need of such kind project in the society.

22. The authors mention that they use the AVSpeech dataset for both training and testing but do not talk about how they split the data. It is possible that the same speakers were used in the training and testing data and so the model is able to recreate a face simply by matching the observed face to the observed audio. This would explain the striking example images shown in the paper.

23. Another interesting application of this research is automated speech or facial animation at scale or in multiple languages. The cutting-edge automated facial animation solution provided by JALI Research Inc is applied in Cyberpunk 2077.

24. It would be interesting to know the model can predict a similar face when one is speaking different languages. A person who is speaking multiple languages can have different tones and accents depending on a language that they speak.

25. The results are actually amazing for the introduction of Speech2Face. As others have mentioned, the researchers might have used a biased dataset of YouTube videos favoring certain ethnicities and their accents and dialects. Thus, it would be nice to also see the data distribution. Additionally it would be nice to see how their model reacts to people who are able to speak multiple languages and see how well Speech2Face generalizes different language pronunciations of one person.

26. The paper introduces Speech2Face and it definitely is one of the major areas of researches in the future. In the paper, the confusion matrix indicates that the model tends to misclassify based on the age of the speaking person. Specifically, the model tends to misclassify between 40-70. It would be interesting to see if the model could improve on its bottleneck by training on more speeches by the age group 40-70.

27. An interesting topic, and as others have mentioned, has many ethical considerations and implications. Particularly in regions where call-recording is permitted, there is dangerous potential to for the technology to be misused to identify and target individuals. It would also be interesting to get a more in depth exploration into how the language spoken and accents have a bias. For example, if a person speaks with a strong British accent, are they classified as white? Particularly for Spanish-speakers, they vary greatly with respect to their skin colour and features, how well does the algorithm work on these individuals. A last nit-pick is the labelling used (i.e. Asian, White, Indian, Black) as this is not accurate since Indians, and moreover South Asians, fall under Asian as well.

28. This topic is quite interesting and it could have great contribution in terms of criminal fight. But as the result, the accuracy is essential. There is still the space for much improvements since to tell a person's face by his/her voice is pretty hard since there are many factors such as oral structure, the language environment and even personality. Great bias could be resulted from these unpredictable factors.

29. This is an interesting topic and could have great use in terms of finding criminals or people when having their voice recorded. However, the voice recording might be noisy and some might include voices of multiple people. It could consider ways to eliminate those factors that might effect the accuracy of the face generation.

30. Most contents described in the paper are very useful. However, YouTube might not be a good enough data source since there are fewer labels to classify. Perhaps, after generating the model, the transfer learning could be done based on Facebook's videos in order to solve the imbalanced problem.

31. This topic is really incredibly interesting and the writers should commend themselves on a job well done. However, Youtube, not only is it an ethnically skewed dataset, but has a non-negligible number of creators who use voice modifiers, auto tune, or a number of other things to change the pitch of their voices, which may lead to the significantly more errors in practical applications. A better dataset to be used could be Skype video calls, or a class room study. Also, judging from the way the model does it's prediction, it seems very prone to overfitting on the dataset, and will not generalize well, since pitch and sound are both incredibly variable across humans.

32. One thing to notice is that the training data used to train the model is downloaded from Youtube, which may be a good site to retrieve a large amount of data. While it allows the possibility that the voices retrieved does not match with the people who made those sounds, claimed by the video. If that is the case, those records will become dirty data, and needs to be cleaned before training the model. Otherwise, there will be some huge misclassifications because some of the training data is not making sense. One way I can think of to improve this problem is that we may train multiple models on different subsets of the original dataset, and combine the results of all the models by taking weighted average.

33. Predicting appearance with sound is a very imaginative research direction. But the author did not explain how to exclude environmental factors in data preprocessing, such as light intensity, facial dress, facial wounds, etc. In the training data set, different sound and image resolutions also affect the effectiveness of the model. The author needs more robustness tests to exclude these factors.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.

[7] “Overview of GAN Structure | Generative Adversarial Networks,” ''Google Developers'', 2019. [Online]. Available: https://developers.google.com/machine-learning/gan/gan_structure. [Accessed: 02-Dec-2020].

Speech2Face: Learning the Face Behind a Voice

2020-12-07T10:07:08Z

Y87yu: /* Discussion and Critiques */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face which utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model produces facial reconstruction images that capture specific physical attributes learning the correlations between faces and voices, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. This model explores what types of facial information could be extracted from speech without the constraints of predefined facial characterizations. Without any prior information or accurate classifiers, the reconstructions revealed correlations between craniofacial features and voice in addition to the correlation between dominant features (gender, age, ethnicity, etc.) and voice. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Ethical Considerations ==

The authors note that due to the potential sensitivity of facial information, they have chosen to explicitly state some ethical considerations. The first of which is privacy. The paper states that the method cannot recover the true identity of the face or produce faces of specific individuals, but rather will show average-looking faces. The paper also addresses that there are potential dataset biases that exist for the voice-face correlations, thus the faces may not accurately represent the intended population. The paper recommends that any further investigation or practical use of this technology will be tested to represent the intended population and also if the data does not reflect this, more representative data should be broadly collected. Finally, it acknowledges that the model uses demographic categories such as "White" and "Asian" that are defined by a commercial face attribute classifier.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, separate speech from multiple concurrent sources, predict lip motion from speech, and even learn the emotion of the agents based on their voices. Aytar et al. [6] proposed a student-teacher training procedure in which a well established visual recognition model was used to transfer the knowledge obtained in the visual modality to the sound modality, using unlabeled videos.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. A generative adversarial network (GAN) model is one that uses a generator to produce seemingly possible data for training and a discriminator that identifies if the training data is fabricated by the generator or if it is real [7]. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
It seems to be a common trait among humans to imagine what some people look like when we hear their voices before we have seen what they look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors are more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg|center]]

<div style="text-align:center;"> Figure 1. '''Speech2Face model and training pipeline''' </div>

The Speech2Face Model consists of two parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The combination of the voice encoder and face decoder results are combined to form an image. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. It needs a model to figure out many irrelevant variations in the data, and to implicitly extract important internal representations of faces. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face.

'''Face Decoder'''
The face decoder itself was taken from previous work The VGG-Face model by Cole et al [3] (a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network.) and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The face decoder kept the VGG-Face model's dimension and weights. The weights were also trained separately and remained fixed during the voice encoder training.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG|center]]

<div style="text-align:center;"> Table 1: '''Voice encoder architecture''' </div>

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time and allows the model to be used for input speeches of varying lengths. Two fully connected layers at the end are used to return a 4096-dimensional facial feature output.

'''Training'''

The AVSpeech dataset, a large-scale audio-visual dataset is used for the training. AVSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair color, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame that contains the face is extracted from each video and then inputted to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096-dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the VGG-Face model <math>f_{VGG}</math>: <math> \mathbb{R}^{4096} \to \mathbb{R}^{2622}</math> and the first layer of face decoder <math>f_{dec}</math> : <math> \mathbb{R}^{4096} \to \mathbb{R}^{1000}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}(a)\text{log}p_{(i)}(b)$$ $$p_{(i)}(a) = \frac{\text{exp}(a_i/T)}{\sum_j \text{exp}(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

<div style="text-align:center;"> Figure 2: '''Qualitative results on the AVSpeech test set''' </div>

'''Implementation Details'''

6 seconds of audio was used to compute the spectogram by taking a Short-time Fourier transform with Hann window of 25mm, hop length of 10ms, and 512 FFT frequenct bands. A CNN-based face detector from Dlib was used to crop the face regions from the frames. The VGG-face features are computed from the resized faces and together with the spectrogram was used for training. There were a total of 1.7 and 0.15 million spectra-feature pairs.

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

<div style="text-align:center;"> Figure 3. '''Facial attribute evaluation''' </div>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

<div style="text-align:center;"> Table 2. '''Feature similarity''' </div>

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2F -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

<div style="text-align:center;"> Table 3. '''S2F -> Face retrieval performance''' </div>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, measures the probability that the K closest images to the model output includes the correct image of the speaker's face. A higher R@K score indicates better performance. From Table 3, above, we see that both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.

'''Additional Observations'''

Ablation studies were carried out to test the effect of audio duration and batch normalization. It was found that the duration of input audio during the training stage had little effect on convergence speed (comparing 3 and 6-second speech segments), while in the test stage longer input speech yields improvement in reconstruction quality. With respect to batch normalization (BN), it was found that without BN reconstructed faces would converge to an average face, while the inclusion of BN led to results which contained much richer facial features.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

There is evidence that the results of the model may be heavily influenced by external factors:

1. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the model's prediction of ethnicity towards white. The bias in the results shows that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Figure (11) highlights this shortcoming: The same man heard speaking in either English or Chinese was predicted to have a "white" appearance or an "asian" appearance respectively.

2. The model was shown to infer different face features based on language. This puts into question how heavily the model depends on the spoken language. The paper mentioned the quality of face reconstruction may be affected by uncommon languages, where English is the most popular language on Youtube(training set). Testing a more controlled sample where all speech recording was of the same language may help address this concern to determine the model's reliance on spoken language.

3. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender, and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face and may result in a compounding effect on the miss-classification rate.

4. Figure 4.b shows the AVSpeech dataset statistics. However, it doesn't show the statistics about speakers' ethnicity and the language of the video. If we train the model with a more comprehensive dataset that includes enough Asian/Indian English speakers and native language speakers will this increase the accuracy?

5. One concern about the source of the training data, i.e. the Youtube videos, is that resolution varies a lot since the videos are randomly selected. That may be the reason why the proposed model performs badly on some certain features. For example, it is hard to tell the age when the resolution is bad because the wrinkles on the face are neglected.

6. The topic of this project is very interesting, but I highly doubt this model will be practical in real-world problems. Because there are many factors to affect a person's sound in a real-world environment. Sounds such as phone clock, TV, car horn and so on. These sounds will decrease the accuracy of the predicted result of the model.

7. A lot of information can be obtained from someone's voice, this can potentially be useful for detective work and crime scene investigation. In our world of increasing surveillance, public voice recording is quite common and we can reconstruct images of potential suspects based on their voice. In order for this to be achieved, the model has to be thoroughly trained and tested to avoid false positives as it could have a highly destructive outcome for a falsely convicted suspect.

8. This is a very interesting topic, and this summary has a good structure for readers. Since this model uses Youtube to train model, but I think one problem is that most of the YouTubers are adult, and many additional reasons make this dataset highly unbalanced. What is more, some people may have a baby voice, this also could affect the performance of the model. But overall, this is a meaningful topic, it might help police to locate the suspects. So it might be interesting to apply this to the police.

9. In addition, it seems very unlikely that any results coming from this model would ever be held in regard even remotely close to being admissible in court to identify a person of interest until the results are improved and the model can be shown to work in real-world applications. Otherwise, there seems to be very little use for such technology and it could have negative impacts on people if they were to be depicted in an unflattering way by the model based on their voice.

10. Using voice as a factor of constructing the face is a good idea, but it seems like the data they have will have lots of noise and bias. The voice of a video might not come from the person in the video. There are so many YouTubers adjusting their voices before uploading their video and it's really hard to know whether they adjust their voice. Also, most YouTubers are adults so the model cannot have enough training samples about teenagers and kids.

11. It would be interesting to see how the performance changes with different face encoding sizes (instead of just 4096-D) and also difference face models (encoder/decoders) to see if better performance can be achieved. Also given that the dataset used was unbalanced, was the dataset used to train the face model the same dataset? or was a different dataset used (the model was pretrained). This could affect the performance of the model as well.

12. The audio input is transformed into a spectrogram before being used for training. They use STFT with a Hann window of 25 mm, a hop length of 10 ms, and 512 FFT frequency bands. They cite this method from a paper that focuses on speech separation, not speech classification. So, it would be interesting to see if there is a better way to do STFT, possibly with different hyperparameters (eg. different windowing, different number of bands), or if another type of transform (eg. wavelet transform) would have better results.

13. A easy way to get somewhat balanced data is to duplicate the data that are fewer.

14. This problem is interesting but is hard to generalize. This algorithm didn't account for other genders and mixed-race. In addition, the face recognition software Face++ introduces bias which can carry forward to Speech2Face algorithm. Face recognition algorithms are known to have higher error rates classifying darker-skinned individuals. Thus, it'll be tough to apply it to real-life scenarios like identifying suspects.

15. This experiment raises a lot of ethical complications when it comes to possible applications in the real world. Even if this model was highly accurate, the implications of being able to discern a person's racial ethnicity, skin tone, etc. based solely on there voice could play in to inherent biases in the application user and this may end up being an issue that needs to be combatted in future research in this area. Another possible issue is that many people will change their intonation or vocal features based on the context (I'll likely have a different voice pattern in a job interview in terms of projection, intonation, etc. than if I was casually chatting/mumbling with a friend while playing video games for example).

16. Overall a very interesting topic. I want to talk about the technical challenged raised by using the AVSSpeech dataset for training. The paper acknowledges that the AVSSpeech is unbalanced, and 80% of the data are white and Asians. It also says in the results section that "Our model does not perform on other races due to the imbalance in data". There does not seem to be any effort made in balancing the data. I think that there are definitely some data processing techniques that can be used (filtering, data augmentation, etc) to address the class imbalance problem. Not seeing any of these in the paper is a bit disappointing. Another issue I have noticed is that the model aims to predict an average-looking face from certain gender/racial group from voice input, due to ethical considerations. If we cannot reveal the identify of a person, why don't we predict the gender and race directly? Giving an average-looking face does not seem to be the most helpful.

17. Very interesting research paper to be studied and the main objective was also interesting. This research leads to open question which can be applied to another application such as predicting person's face using voice and can be used in more advanced way. The only risk is how the data is obtained from YouTube where data is not consistent.

18. The essay uses millions of natural videos of people speaking to find the correlation between face and voice. Since face and voice are commonly used as the identity of a person, there are many possible research opportunities and applications about improving voice and face unlock.

19. It would be better to have a future work section to discuss the current shortage and explore the possible improvement and applications in the future.

20. While the idea behind Speech2Face is interesting, ethnic profiling is a huge concern and it can further lead to racial discrimination, racism etc. Developers must put more care and thought into applying Speech2Face in tech before deploying the products.

21. It would be helpful if the author could explore the different applications of this project in real life. Speech2face can be helpful during criminal investigation and essentially in scenarios when someone's picture is missing and only voice is available. It would also be helpful if the author could state the importance and need of such kind project in the society.

22. The authors mention that they use the AVSpeech dataset for both training and testing but do not talk about how they split the data. It is possible that the same speakers were used in the training and testing data and so the model is able to recreate a face simply by matching the observed face to the observed audio. This would explain the striking example images shown in the paper.

23. Another interesting application of this research is automated speech or facial animation at scale or in multiple languages. The cutting-edge automated facial animation solution provided by JALI Research Inc is applied in Cyberpunk 2077.

24. It would be interesting to know the model can predict a similar face when one is speaking different languages. A person who is speaking multiple languages can have different tones and accents depending on a language that they speak.

25. The results are actually amazing for the introduction of Speech2Face. As others have mentioned, the researchers might have used a biased dataset of YouTube videos favoring certain ethnicities and their accents and dialects. Thus, it would be nice to also see the data distribution. Additionally it would be nice to see how their model reacts to people who are able to speak multiple languages and see how well Speech2Face generalizes different language pronunciations of one person.

26. The paper introduces Speech2Face and it definitely is one of the major areas of researches in the future. In the paper, the confusion matrix indicates that the model tends to misclassify based on the age of the speaking person. Specifically, the model tends to misclassify between 40-70. It would be interesting to see if the model could improve on its bottleneck by training on more speeches by the age group 40-70.

27. An interesting topic, and as others have mentioned, has many ethical considerations and implications. Particularly in regions where call-recording is permitted, there is dangerous potential to for the technology to be misused to identify and target individuals. It would also be interesting to get a more in depth exploration into how the language spoken and accents have a bias. For example, if a person speaks with a strong British accent, are they classified as white? Particularly for Spanish-speakers, they vary greatly with respect to their skin colour and features, how well does the algorithm work on these individuals. A last nit-pick is the labelling used (i.e. Asian, White, Indian, Black) as this is not accurate since Indians, and moreover South Asians, fall under Asian as well.

28. This topic is quite interesting and it could have great contribution in terms of criminal fight. But as the result, the accuracy is essential. There is still the space for much improvements since to tell a person's face by his/her voice is pretty hard since there are many factors such as oral structure, the language environment and even personality. Great bias could be resulted from these unpredictable factors.

29. This is an interesting topic and could have great use in terms of finding criminals or people when having their voice recorded. However, the voice recording might be noisy and some might include voices of multiple people. It could consider ways to eliminate those factors that might effect the accuracy of the face generation.

30. Most contents described in the paper are very useful. However, YouTube might not be a good enough data source since there are fewer labels to classify. Perhaps, after generating the model, the transfer learning could be done based on Facebook's videos in order to solve the imbalanced problem.

31. This topic is really incredibly interesting and the writers should commend themselves on a job well done. However, Youtube, not only is it an ethnically skewed dataset, but has a non-negligible number of creators who use voice modifiers, auto tune, or a number of other things to change the pitch of their voices, which may lead to the significantly more errors in practical applications. A better dataset to be used could be Skype video calls, or a class room study. Also, judging from the way the model does it's prediction, it seems very prone to overfitting on the dataset, and will not generalize well, since pitch and sound are both incredibly variable across humans.

32. One thing to notice is that the training data used to train the model is downloaded from Youtube, which may be a good site to retrieve a large amount of data. While it allows the possibility that the voices retrieved does not match with the people who made those sounds, claimed by the video. If that is the case, those records will become dirty data, and needs to be cleaned before training the model. Otherwise, there will be some huge misclassifications because some of the training data is not making sense. One way I can think of to improve this problem is that we may train multiple models on different subsets of the original dataset, and combine the results of all the models by taking weighted average.

33. Predicting appearance with sound is a very imaginative research direction. But the author did not explain how to exclude environmental factors in data preprocessing, such as light intensity, facial dress, facial wounds, etc. In the training data set, different sound and image resolutions also affect the effectiveness of the model. The author needs more robustness tests to exclude these factors.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.

[7] “Overview of GAN Structure | Generative Adversarial Networks,” ''Google Developers'', 24-May-2019. [Online]. Available: https://developers.google.com/machine-learning/gan/gan_structure. [Accessed: 02-Dec-2020].

A universal SNP and small-indel variant caller using deep neural networks

2020-12-07T09:58:38Z

Y87yu: /* Critique and Discussion */

== Background ==

Genes determine Biological functions, and mutants or alleles(one of two or more alternative forms of a gene that arise by mutation and are found at the same place on a chromosome) of those genes determine differences within a function. Determining novel alleles is very important in understanding the genetic variation within a species. For example, different alleles of the gene OCA2 determine the colour of pupils. All animals receive one copy of each gene from each of their parents. Mutations of a gene are classified as either homozygous (both copies are the same) or heterozygous (the two copies are different).

Next-generation sequencing is a prevalent technique for sequencing or reading DNA. Since all genes are encoded as DNA, sequencing is an essential tool for understanding genes. Next-generation sequencing works by reading short sections of DNA of length k, called k-means, and then piecing them together or aligning them to a reference genome. Next-generation sequencing is relatively fast and inexpensive, although it can randomly misidentify some nucleotides, introducing errors. However, NGS reading is errorful and arises from a complex error process depending on various factors.

The process of variant calling is determining novel alleles from sequencing data (typically next-generation sequencing data). Some significant alleles only differ from the "standard" version of a gene by only a single base pair, such as the mutation which causes multiple sclerosis. Therefore it is crucial to accurately call single nucleotide swaps/polymorphisms (SNPs), insertions, and deletions (indels). Calling SNPs and small indels are technically challenging since it requires a program to distinguish between genuinely novel mutations and errors in the sequencing data.

Previous approaches usually involved using various statistical techniques. A widely used one is GATK. GATK uses a combination of logistic regression, hidden Markov models, naive Bayes classification, and Gaussian mixture models to perform the process [2]. However, these methods have their weaknesses as some assumptions do not hold (i.e., independence assumptions). In addition, given that GATK testing has focused primarily on human whole-genome data sequenced using Illumina technology, it is not easily generalizable to different types of data, organisms, and experimental designs/sequencing technologies [[https://gatk.broadinstitute.org/hc/en-us/articles/360035894711-About-the-GATK-Best-Practices 3]].

This paper aims to solve the problem of calling SNPs and small indels using a convolutional neural net by casting the reads as images and classifying whether they contain a mutation. It introduces a variant caller called "DeepVariant", which requires no specialized knowledge, but performs better than previous state-of-art methods.

== Overview ==

In Figure 1, the DeepVariant workflow overview is illustrated.

[[File:figure 111.JPG|Figure 1. In all panels, blue boxes represent data and red boxes are processes]]

Initially, the NGS reads aligned to a reference genome which are then scanned for candidate variants which are different sites from the reference genome. The read and reference data are encoded as an image for each candidate variant site. Then, the trained CNN can compute the genotype likelihoods, (heterozygous or homozygous) for each of the candidate variants (figure1, left box).
To train the CNN for image classification purposes, the DeepVariant machinery makes pileup images for a labeled sample with known genotypes. These labeled images and known genotypes are provided to CNN for training, and a stochastic gradient descent algorithm is used to optimize the CNN parameters to maximize genotype prediction accuracy. After the convergence of the model, the final model is frozen to use for calling mutations for other image classification tests (figure1, middle box).
For example, in figure 1 (right box), the reference and read bases are encoded into a pileup image at a candidate variant site. The CNN using this encoded image computes the genotype likelihoods for the three diploid genotype states of homozygous reference (hom-ref), heterozygous (het) or homozygous alternate (hom-alt). In this example, a heterozygous variant call is emitted, as the most probable genotype here is “het”.

== Preprocessing ==

Before the sequencing reads can be fed into the classifier, they must be pre-processed. There are many pre-processing steps that are necessary for this algorithm. These steps represent the real novelty in this technique by transforming the data to allow us to use more common neural network architectures for classification. The pre-processing of the data can be broken into three phases: the realignment of reads, finding candidate variants and creating the candidate variants' images.

The realignment of the pre-processing reads phase is essential to ensure the sequences can be adequately compared to the reference sequences. First, the sequences are aligned to a reference sequence. Reads that align poorly are grouped with other reads around them to build that section, or haplotype, from scratch. If there is strong evidence that the new version of the haplotype fits the reads well, the reads are re-aligned. This process updates the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string to represent a sequence's alignment to a reference for each read.

Once the reads are correctly aligned, the algorithm then proceeds to find candidate variants, regions in the DNA sequence containing variants. It is these candidate variants that will eventually be passed as input to the neural network. To find these, we need to consider each position in the reference sequence independently. Any unusable reads are filtered at this point. This includes reads that are not appropriately aligned, marked as duplicates, those that fail vendor quality checks, or whose mapping quality is less than ten. For each site in the genome, we collect all the remaining reads that overlap that site. The corresponding allele aligned to that site is then determined by decoding the CIGAR string, which was updated in each read's realignment phase. The alleles are then classified into one of four categories: reference-matching base, reference-mismatching base, insertion with a specific sequence, or deletion with a specific length, and the number of occurrences of each distinct allele across all reads is counted. Read bases are only included as potential alleles if each base in the allele has a quality score of at least 10.

The last phase of pre-processing is to convert these candidate variants into images representing the data with candidate variants identified. This allows for the use of well established convolutional neural networks for image classification for this technical problem. Each color channel is used to store a different piece of information about a candidate variant. The red channel encodes which base we have (A, G, C, or T) by mapping each base to a particular value. The quality of the read is mapped to the green color channel.

Moreover, the blue channel encodes whether or not the reference is on the positive strand of the DNA. Each row of the image represents a read, and each column represents a particular base in that read. The reference strand is repeated for the first five rows of the encoded image to maintain its information after a 5x5 convolution is applied. With the data pre-processing complete, the images can then be passed into the neural network for classification.

== Neural Network ==

The neural network used is a convolutional neural network. Although the full network architecture is not revealed in the paper, there are several details which we can discuss. The architecture of the network is an input layer attached to an adapted Inception v2 ImageNet model with nine partitions. The inception v2 model in particular uses a series of CNNs. One interesting aspect about the Inception model is that rather than optimizing a series of hyperparameters in order to determine the most optimal parameter configuration, Inception instead concatenates a series of different sizes of filters on the same layer, which acts to learn the best architecture out of these concatenated filters. The input layer takes as input the images representing the candidate variants and rescales them to 299x299 pixels. The output layer is a three-class Softmax layer initialized with Gaussian random weights with a standard deviation of 0.001. This final layer is fully connected to the previous layer. The three classes are the homozygous reference (meaning it is not a variant), heterozygous variant, and homozygous variant. The candidate variant is classified into the class with the highest probability. The model is trained using stochastic gradient descent with a weight decay of 0.00004. The training was done in mini-batches, each with 32 images, using a root mean squared (RMS) decay of 0.9. For the multiple sequencing technologies experiments, a single model was trained with a learning rate of 0.0015 and momentum 0.8 for 250,000 update steps. For all other experiments, multiple models were trained, and the one with the highest accuracy on the training set was chosen as the final model. The multiple models stem from using each combination of the possible parameter values for the learning rate (0.00095, 0.001, 0.0015) and momentum (0.8, 0.85, 0.9). These models were trained for 80 hours, or until the training accuracy converged.

== Results ==

DeepVariant was trained using data available from the CEPH (Centre d’Etude du Polymorphism Humain) female sample NA12878 and was evaluated on the unseen Ashkenazi male sample NA24385. The results were compared with other most commonly used bioinformatics methods, such as the GATK, FreeBayes22, SAMtools23, 16GT24 and Strelka25 (Table 1). For better comparison, the overall accuracy (F1), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are illustrated over the whole genome.

[[File:table 11.JPG]]

DeepVariant showed the highest accuracy and more than 50% fewer errors per genome compared to the next best algorithm.

They also evaluated the same set of algorithms using the synthetic diploid sample CHM1-CHM1326 (Table 2).

[[File:Table 333.JPG]]

Results illustrated that the DeepVariant method outperformed all other algorithms for variant calling (SNP and indel) and showed the highest accuracy in terms of F1, Recall, precision and TP.

== Conclusion ==

This endeavor to further advance a data-centric approach to understanding the gene sequence illustrates the advantages of deep learning over humans. With billions of DNA base pairs, no humans can digest that amount of gene expressions. In the past, computational techniques are unfeasible due to the lack of computing power, but in the 21st century, it seems that machine learning is the way to go for molecular biology.

DeepVariant’s strong performance on human data proves that deep learning is a promising technique for variant calling. Perhaps the most exciting feature of DeepVariant is its simplicity. Unlike other states of the art variant callers, DeepVariant does not know the sequencing technologies that create the reads or even the biological processes that introduce mutations. It simplifies the problem of variant calling to preprocessing the reads and training a generic deep learning model. It also suggests that DeepVariant could be significantly improved by tailoring the preprocessing to specific sequencing technologies and developing a dedicated CNN architecture for the reads, rather than casting them as images.

== Critique and Discussion==

The paper presents an attractive method for solving a significant problem. Building "images" of reads and running them through a generic image classification CNN seems like a strange approach, and, interestingly, it works well. The most significant issues with the paper are the lack of specific information about how the methods. Some extra information is included in the supplementary material, but there are still some significant gaps. In particular:

1. What is the structure of the neural net? How many layers, and what sizes? The paper for ConvNet does not have this information. We suspect that this might be a trade secret that Google is protecting.

2. How is the realignment step implemented? The paper mentions that it uses a "De-Bruijn-graph-based read assembly procedure" to realign reads to a new haplotype. It is a non-standard step in most genomics workflows, yet the paper does not describe how they do the realignment or build the haplotypes.

3. How did they settle on the image construction algorithm? The authors provide pseudocode for the construction of pileup images, but they do not describe how to make decisions. For instance, the color values for different base pairs are not evenly spaced. Also, the image begins with five rows of the reference genome.

One thing we appreciated about the paper was their commentary on future developments. The authors clarify that this approach can be improved on and provide specific ideas for the next steps.

Overall, the paper presents an interesting idea with strong results but lacks detail in some vital implementation pieces.

4. The topic of this project is good, but we need more details on the algorithm. In the neural network part, the details are not enough; authors should provide a figure to explain better how the model works and the model's structure. Otherwise, we cannot understand how the model works. When we preprocess the data if different data have different lengths, shall we add more information or drop some information to match?

5. Particularly, which package did the researchers use to perform this analysis? Different packages of deep learning can have different accuracy and efficiency while making predictions on this data set.

Further studies on DeepVariant [https://www.nature.com/articles/s41598-018-36177-7 have shown] that it is a framework with great potential and sets the medical standard genetics field.

Another good follow up works can be seen here [https://www.semanticscholar.org/paper/A-review-of-somatic-single-nucleotide-variant-for-Xu/81b4246c90c3c8036ff719b9d4c3d5e83fbb32dc]

6. It was mentioned that part of the network used is an "adapted Inception v2 ImageNet model" - does this mean that it used an Inception v2 model that was trained on ImageNet? This is not clear, but if this is the case, then why is this useful? Why would the features that are extracted for an image be useful for genomics? Did they try using a model that was not trained? Also, they describe the preprocessing method used but were there any alternatives that they considered?

7. A more extensive discussion on the "Neural Network" section can be given. For example, the paragraph's last sentence says that "Models are trained for 80 hours, or until the training accuracy converged." This sentence implies that if the training accuracy does converge, it usually takes less than 80 hours. More exciting data can thus be presented about the training accuracy converging time. How often do the models converge? How long does it take for the models to converge on average? Moreover, why is the number 80 chosen here? Is it a random upper bound set by people to bound the runtime of the model training, or is the number 80 carefully chosen so that it is twice/three times/ten times the average training accuracy converging time?

8. It would be more convincing if the author could provide more detail on the structure of the neural network.

9. It is clear that this is a very thoroughly written paper with substantial comparison results and computation numbers to back up the testing. However, simply because the implementation link is given in the paper, there still lacks information regarding the structure of the model. It is also structurally missing a conclusion section to complete a summary on the overall conclusion and results comparison to give a final conclusion to the efficacy of the proposed model.

10. It would be interesting to see how the model would behave if we incorporate transformers to the model, this has been on reason Alpha Fold 2 was so successful.

11. The result illustration is hard to interpret, it would be nicer if explanations can be added. Lacking details on neural networks used. How was mathematical calculations done on prediction?

12. It would be better if the author could provide a list of other potential networks that could be used to address the problem and a comparison between them.

13. It is very interesting to see that a part of the data preprocessing is to create images from the data as one would normally expect you to simply feed the data itself into the network. By converting to image data does this help the network classify it better or would simply feeding forward the unconverted data be better? This is interesting as at the end of the day an image is still just numerical data so would one representation hold more value over the other?

14. The author opened the door to machine learning solutions in molecular biology by solving genetic problems through data methods. But the details of the neural network training parameters still needs to be clarified. In addition, it is worthy of attention and further explanation about the results of prediction accuracy, under what environment, and how to compare with human predictions.

== References ==
[1] Hartwell, L.H. ''et. al.'' ''Genetics: From Genes to Genomes''. (McGraw-Hill Ryerson, 2014).

[2] Poplin, R. ''et. al''. A universal SNP and small-indel variant caller using deep neural networks. ''Nature Biotechnology'' '''36''', 983-987 (2018).

A universal SNP and small-indel variant caller using deep neural networks

2020-12-07T09:58:12Z

Y87yu: /* Critique and Discussion */

== Background ==

Genes determine Biological functions, and mutants or alleles(one of two or more alternative forms of a gene that arise by mutation and are found at the same place on a chromosome) of those genes determine differences within a function. Determining novel alleles is very important in understanding the genetic variation within a species. For example, different alleles of the gene OCA2 determine the colour of pupils. All animals receive one copy of each gene from each of their parents. Mutations of a gene are classified as either homozygous (both copies are the same) or heterozygous (the two copies are different).

Next-generation sequencing is a prevalent technique for sequencing or reading DNA. Since all genes are encoded as DNA, sequencing is an essential tool for understanding genes. Next-generation sequencing works by reading short sections of DNA of length k, called k-means, and then piecing them together or aligning them to a reference genome. Next-generation sequencing is relatively fast and inexpensive, although it can randomly misidentify some nucleotides, introducing errors. However, NGS reading is errorful and arises from a complex error process depending on various factors.

The process of variant calling is determining novel alleles from sequencing data (typically next-generation sequencing data). Some significant alleles only differ from the "standard" version of a gene by only a single base pair, such as the mutation which causes multiple sclerosis. Therefore it is crucial to accurately call single nucleotide swaps/polymorphisms (SNPs), insertions, and deletions (indels). Calling SNPs and small indels are technically challenging since it requires a program to distinguish between genuinely novel mutations and errors in the sequencing data.

Previous approaches usually involved using various statistical techniques. A widely used one is GATK. GATK uses a combination of logistic regression, hidden Markov models, naive Bayes classification, and Gaussian mixture models to perform the process [2]. However, these methods have their weaknesses as some assumptions do not hold (i.e., independence assumptions). In addition, given that GATK testing has focused primarily on human whole-genome data sequenced using Illumina technology, it is not easily generalizable to different types of data, organisms, and experimental designs/sequencing technologies [[https://gatk.broadinstitute.org/hc/en-us/articles/360035894711-About-the-GATK-Best-Practices 3]].

This paper aims to solve the problem of calling SNPs and small indels using a convolutional neural net by casting the reads as images and classifying whether they contain a mutation. It introduces a variant caller called "DeepVariant", which requires no specialized knowledge, but performs better than previous state-of-art methods.

== Overview ==

In Figure 1, the DeepVariant workflow overview is illustrated.

[[File:figure 111.JPG|Figure 1. In all panels, blue boxes represent data and red boxes are processes]]

Initially, the NGS reads aligned to a reference genome which are then scanned for candidate variants which are different sites from the reference genome. The read and reference data are encoded as an image for each candidate variant site. Then, the trained CNN can compute the genotype likelihoods, (heterozygous or homozygous) for each of the candidate variants (figure1, left box).
To train the CNN for image classification purposes, the DeepVariant machinery makes pileup images for a labeled sample with known genotypes. These labeled images and known genotypes are provided to CNN for training, and a stochastic gradient descent algorithm is used to optimize the CNN parameters to maximize genotype prediction accuracy. After the convergence of the model, the final model is frozen to use for calling mutations for other image classification tests (figure1, middle box).
For example, in figure 1 (right box), the reference and read bases are encoded into a pileup image at a candidate variant site. The CNN using this encoded image computes the genotype likelihoods for the three diploid genotype states of homozygous reference (hom-ref), heterozygous (het) or homozygous alternate (hom-alt). In this example, a heterozygous variant call is emitted, as the most probable genotype here is “het”.

== Preprocessing ==

Before the sequencing reads can be fed into the classifier, they must be pre-processed. There are many pre-processing steps that are necessary for this algorithm. These steps represent the real novelty in this technique by transforming the data to allow us to use more common neural network architectures for classification. The pre-processing of the data can be broken into three phases: the realignment of reads, finding candidate variants and creating the candidate variants' images.

The realignment of the pre-processing reads phase is essential to ensure the sequences can be adequately compared to the reference sequences. First, the sequences are aligned to a reference sequence. Reads that align poorly are grouped with other reads around them to build that section, or haplotype, from scratch. If there is strong evidence that the new version of the haplotype fits the reads well, the reads are re-aligned. This process updates the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string to represent a sequence's alignment to a reference for each read.

Once the reads are correctly aligned, the algorithm then proceeds to find candidate variants, regions in the DNA sequence containing variants. It is these candidate variants that will eventually be passed as input to the neural network. To find these, we need to consider each position in the reference sequence independently. Any unusable reads are filtered at this point. This includes reads that are not appropriately aligned, marked as duplicates, those that fail vendor quality checks, or whose mapping quality is less than ten. For each site in the genome, we collect all the remaining reads that overlap that site. The corresponding allele aligned to that site is then determined by decoding the CIGAR string, which was updated in each read's realignment phase. The alleles are then classified into one of four categories: reference-matching base, reference-mismatching base, insertion with a specific sequence, or deletion with a specific length, and the number of occurrences of each distinct allele across all reads is counted. Read bases are only included as potential alleles if each base in the allele has a quality score of at least 10.

The last phase of pre-processing is to convert these candidate variants into images representing the data with candidate variants identified. This allows for the use of well established convolutional neural networks for image classification for this technical problem. Each color channel is used to store a different piece of information about a candidate variant. The red channel encodes which base we have (A, G, C, or T) by mapping each base to a particular value. The quality of the read is mapped to the green color channel.

Moreover, the blue channel encodes whether or not the reference is on the positive strand of the DNA. Each row of the image represents a read, and each column represents a particular base in that read. The reference strand is repeated for the first five rows of the encoded image to maintain its information after a 5x5 convolution is applied. With the data pre-processing complete, the images can then be passed into the neural network for classification.

== Neural Network ==

The neural network used is a convolutional neural network. Although the full network architecture is not revealed in the paper, there are several details which we can discuss. The architecture of the network is an input layer attached to an adapted Inception v2 ImageNet model with nine partitions. The inception v2 model in particular uses a series of CNNs. One interesting aspect about the Inception model is that rather than optimizing a series of hyperparameters in order to determine the most optimal parameter configuration, Inception instead concatenates a series of different sizes of filters on the same layer, which acts to learn the best architecture out of these concatenated filters. The input layer takes as input the images representing the candidate variants and rescales them to 299x299 pixels. The output layer is a three-class Softmax layer initialized with Gaussian random weights with a standard deviation of 0.001. This final layer is fully connected to the previous layer. The three classes are the homozygous reference (meaning it is not a variant), heterozygous variant, and homozygous variant. The candidate variant is classified into the class with the highest probability. The model is trained using stochastic gradient descent with a weight decay of 0.00004. The training was done in mini-batches, each with 32 images, using a root mean squared (RMS) decay of 0.9. For the multiple sequencing technologies experiments, a single model was trained with a learning rate of 0.0015 and momentum 0.8 for 250,000 update steps. For all other experiments, multiple models were trained, and the one with the highest accuracy on the training set was chosen as the final model. The multiple models stem from using each combination of the possible parameter values for the learning rate (0.00095, 0.001, 0.0015) and momentum (0.8, 0.85, 0.9). These models were trained for 80 hours, or until the training accuracy converged.

== Results ==

DeepVariant was trained using data available from the CEPH (Centre d’Etude du Polymorphism Humain) female sample NA12878 and was evaluated on the unseen Ashkenazi male sample NA24385. The results were compared with other most commonly used bioinformatics methods, such as the GATK, FreeBayes22, SAMtools23, 16GT24 and Strelka25 (Table 1). For better comparison, the overall accuracy (F1), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are illustrated over the whole genome.

[[File:table 11.JPG]]

DeepVariant showed the highest accuracy and more than 50% fewer errors per genome compared to the next best algorithm.

They also evaluated the same set of algorithms using the synthetic diploid sample CHM1-CHM1326 (Table 2).

[[File:Table 333.JPG]]

Results illustrated that the DeepVariant method outperformed all other algorithms for variant calling (SNP and indel) and showed the highest accuracy in terms of F1, Recall, precision and TP.

== Conclusion ==

This endeavor to further advance a data-centric approach to understanding the gene sequence illustrates the advantages of deep learning over humans. With billions of DNA base pairs, no humans can digest that amount of gene expressions. In the past, computational techniques are unfeasible due to the lack of computing power, but in the 21st century, it seems that machine learning is the way to go for molecular biology.

DeepVariant’s strong performance on human data proves that deep learning is a promising technique for variant calling. Perhaps the most exciting feature of DeepVariant is its simplicity. Unlike other states of the art variant callers, DeepVariant does not know the sequencing technologies that create the reads or even the biological processes that introduce mutations. It simplifies the problem of variant calling to preprocessing the reads and training a generic deep learning model. It also suggests that DeepVariant could be significantly improved by tailoring the preprocessing to specific sequencing technologies and developing a dedicated CNN architecture for the reads, rather than casting them as images.

== Critique and Discussion==

The paper presents an attractive method for solving a significant problem. Building "images" of reads and running them through a generic image classification CNN seems like a strange approach, and, interestingly, it works well. The most significant issues with the paper are the lack of specific information about how the methods. Some extra information is included in the supplementary material, but there are still some significant gaps. In particular:

1. What is the structure of the neural net? How many layers, and what sizes? The paper for ConvNet does not have this information. We suspect that this might be a trade secret that Google is protecting.

2. How is the realignment step implemented? The paper mentions that it uses a "De-Bruijn-graph-based read assembly procedure" to realign reads to a new haplotype. It is a non-standard step in most genomics workflows, yet the paper does not describe how they do the realignment or build the haplotypes.

3. How did they settle on the image construction algorithm? The authors provide pseudocode for the construction of pileup images, but they do not describe how to make decisions. For instance, the color values for different base pairs are not evenly spaced. Also, the image begins with five rows of the reference genome.

One thing we appreciated about the paper was their commentary on future developments. The authors clarify that this approach can be improved on and provide specific ideas for the next steps.

Overall, the paper presents an interesting idea with strong results but lacks detail in some vital implementation pieces.

4. The topic of this project is good, but we need more details on the algorithm. In the neural network part, the details are not enough; authors should provide a figure to explain better how the model works and the model's structure. Otherwise, we cannot understand how the model works. When we preprocess the data if different data have different lengths, shall we add more information or drop some information to match?

5. Particularly, which package did the researchers use to perform this analysis? Different packages of deep learning can have different accuracy and efficiency while making predictions on this data set.

Further studies on DeepVariant [https://www.nature.com/articles/s41598-018-36177-7 have shown] that it is a framework with great potential and sets the medical standard genetics field.

Another good follow up works can be seen here [https://www.semanticscholar.org/paper/A-review-of-somatic-single-nucleotide-variant-for-Xu/81b4246c90c3c8036ff719b9d4c3d5e83fbb32dc]

6. It was mentioned that part of the network used is an "adapted Inception v2 ImageNet model" - does this mean that it used an Inception v2 model that was trained on ImageNet? This is not clear, but if this is the case, then why is this useful? Why would the features that are extracted for an image be useful for genomics? Did they try using a model that was not trained? Also, they describe the preprocessing method used but were there any alternatives that they considered?

7. A more extensive discussion on the "Neural Network" section can be given. For example, the paragraph's last sentence says that "Models are trained for 80 hours, or until the training accuracy converged." This sentence implies that if the training accuracy does converge, it usually takes less than 80 hours. More exciting data can thus be presented about the training accuracy converging time. How often do the models converge? How long does it take for the models to converge on average? Moreover, why is the number 80 chosen here? Is it a random upper bound set by people to bound the runtime of the model training, or is the number 80 carefully chosen so that it is twice/three times/ten times the average training accuracy converging time?

8. It would be more convincing if the author could provide more detail on the structure of the neural network.

9. It is clear that this is a very thoroughly written paper with substantial comparison results and computation numbers to back up the testing. However, simply because the implementation link is given in the paper, there still lacks information regarding the structure of the model. It is also structurally missing a conclusion section to complete a summary on the overall conclusion and results comparison to give a final conclusion to the efficacy of the proposed model.

10. It would be interesting to see how the model would behave if we incorporate transformers to the model, this has been on reason Alpha Fold 2 was so successful.

11. The result illustration is hard to interpret, it would be nicer if explanations can be added. Lacking details on neural networks used. How was mathematical calculations done on prediction?

12. It would be better if the author could provide a list of other potential networks that could be used to address the problem and a comparison between them.

13. It is very interesting to see that a part of the data preprocessing is to create images from the data as one would normally expect you to simply feed the data itself into the network. By converting to image data does this help the network classify it better or would simply feeding forward the unconverted data be better? This is interesting as at the end of the day an image is still just numerical data so would one representation hold more value over the other?

14. The author opening the door to machine learning solutions in molecular biology by solving genetic problems through data methods. But the details of the neural network training parameters still needs to be clarified. In addition, it is worthy of attention and further explanation about the results of prediction accuracy, under what environment, and how to compare with human predictions.

== References ==
[1] Hartwell, L.H. ''et. al.'' ''Genetics: From Genes to Genomes''. (McGraw-Hill Ryerson, 2014).

[2] Poplin, R. ''et. al''. A universal SNP and small-indel variant caller using deep neural networks. ''Nature Biotechnology'' '''36''', 983-987 (2018).

Being Bayesian about Categorical Probability

2020-12-07T09:46:42Z

Y87yu: /* Conclusion and Critiques */

== Presented By ==
Evan Li, Jason Pu, Karam Abuaisha, Nicholas Vadivelu

== Introduction ==

Since the outputs of neural networks are not probabilities, Softmax (Bridle, 1990) is a staple for neural network’s performing classification -- it exponentiates each logit then normalizes by the sum, giving a distribution over the target classes. Logit is a raw output/prediction of the model which is hard for humans to interpret, thus we transform/normalize these raw values into categories or meaningful numbers for interpretability. However, networks with softmax outputs give no information about uncertainty (Blundell et al., 2015; Gal & Ghahramani, 2016), and the resulting distribution over classes is poorly calibrated (Guo et al., 2017), often giving overconfident predictions even when the classification is wrong. In addition, softmax also raises concerns about overfitting NNs due to its confident predictive behaviors (Xie et al., 2016; Pereyra et al., 2017). To achieve performance with better generalization, some more effective regularization techniques might be required.

Bayesian Neural Networks (BNNs; MacKay, 1992) can alleviate these issues, but the resulting posteriors over the parameters are often intractable. Approximations such as variational inference (Graves, 2011; Blundell et al., 2015) and Monte Carlo Dropout (Gal & Ghahramani, 2016) can still be expensive or give poor estimates for the posteriors. This work proposes a Bayesian treatment of the output logits of the neural network, treating the targets as a categorical random variable instead of a fixed label. This technique gives a computationally cheap way of being Bayesian to get well-calibrated uncertainty estimates on neural network classifications.

== Related Work ==

Using Bayesian Neural Networks is the dominant way of applying Bayesian techniques to neural networks. A Bayesian neural network is a stochastic artificial neural network that is trained using Bayesian inference. Bayesian neural networks usually have better calibration than classical neural networks, which indicates that their predicted uncertainty is more consistent with the observed errors. Bayesian networks are data-efficient and can learn with small datasets without overfitting (Jospin, Buntine, Boussaid, Laga, & Bennamoun, 2020). Many techniques have been developed to make posterior approximation more accurate and scalable, despite these, BNNs do not scale to the state of the art techniques or large data sets. There are techniques to explicitly avoid modeling the full weight posterior that is more scalable, such as with Monte Carlo Dropout (Gal & Ghahramani, 2016) or tracking mean/covariance of the posterior during training (Mandt et al., 2017; Zhang et al., 2018; Maddox et al., 2019; Osawa et al., 2019). Non-Bayesian uncertainty estimation techniques such as deep ensembles (Lakshminarayanan et al., 2017) and temperature scaling (Guo et al., 2017; Neumann et al., 2018).

== Preliminaries ==
=== Definitions ===
Let's formalize our classification problem and define some notations for the rest of this summary:

::Dataset:
$$ \mathcal D = \{(x_i,y_i)\} \in (\mathcal X \times \mathcal Y)^N $$
::General classification model
$$ f^W: \mathcal X \to \mathbb R^K $$
::Softmax function:
$$ \phi(x): \mathbb R^K \to [0,1]^K \;\;|\;\; \phi_k(X) = \frac{\exp(f_k^W(x))}{\sum_{k \in K} \exp(f_k^W(x))} $$
::Softmax activated NN:
$$ \phi \;\circ\; f^W: \chi \to \Delta^{K-1} $$
::NN as a true classifier:
$$ arg\max_i \;\circ\; \phi_i \;\circ\; f^W \;:\; \mathcal X \to \mathcal Y $$

We'll also define the '''count function''' - a <math>K</math>-vector valued function that outputs the occurences of each class coincident with <math>x</math>:
$$ c^{\mathcal D}(x) = \sum_{(x',y') \in \mathcal D} \mathbb y' I(x' = x) $$

=== Classification With a Neural Network ===
A typical loss function used in classification is cross-entropy, which is defined by

$$ l_{\rm CE}(\tilde{y},\phi(f^{W}(x)))=-\sum_k \tilde{y_k} \log \phi_k(f^{W}(x))) $$

,here <math>y_k</math> and <math>\phi_k</math> refers to the actual and predicted categorical distribution for each class. It's well known that optimizing <math>f^W</math> for <math>l_{CE}</math> is equivalent to optimizing for <math>l_{KL}</math>, the <math>KL</math> divergence between the true distribution and the distribution modeled by NN, that is:
$$ l_{KL}(W) = KL(\text{true distribution} \;|\; \text{distribution encoded by }NN(W)) $$
Let's introduce notations for the underlying (true) distributions of our problem. Let <math>(x_0,y_0) \sim (\mathcal X \times \mathcal Y)</math>:
$$ \text{Full Distribution} = F(x,y) = P(x_0 = x,y_0 = y) $$
$$ \text{Marginal Distribution} = P(x) = F(x_0 = x) $$
$$ \text{Point Class Distribution} = P(y_0 = y \;|\; x_0 = x) = F_x(y) $$
Then we have the following factorization:
$$ F(x,y) = P(x,y) = P(y|x)P(x) = F_x(y)F(x) $$
Substitute this into the definition of KL divergence:
$$ = \sum_{(x,y) \in \mathcal X \times \mathcal Y} F(x,y) \log\left(\frac{F(x,y)}{\phi_y(f^W(x))}\right) $$
$$ = \sum_{x \in \mathcal X} F(x) \sum_{y \in \mathcal Y} F(y|x) \log\left( \frac{F(y|x)}{\phi_y(f^W(x))} \right) $$
$$ = \sum_{x \in \mathcal X} F(x) \sum_{y \in \mathcal Y} F_x(y) \log\left( \frac{F_x(y)}{\phi_y(f^W(x))} \right) $$
$$ = \sum_{x \in \mathcal X} F(x) KL(F_x \;||\; \phi\left( f^W(x) \right)) $$
As usual, we don't have an analytic form for <math>l</math> (if we did, this would imply we know <math>F_X</math> meaning we knew the distribution in the first place). Instead, estimate from <math>\mathcal D</math>:
$$ F(x) \approx \hat F(x) = \frac{||c^{\mathcal D}(x)||_1}{N} $$
$$ F_x(y) \approx \hat F_x(y) = \frac{c^{\mathcal D}(x)}{|| c^{\mathcal D}(x) ||_1}$$
$$ \to l_{KL}(W) = \sum_{x \in \mathcal D} \frac{||c^{\mathcal D}(x)||_1}{N} KL \left( \frac{c^{\mathcal D}(x)}{||c^{\mathcal D}(x)||_1} \;||\; \phi(f^W(x)) \right) $$
The approximations <math>\hat F, \hat F_X</math> are often not very good though: consider a typical classification such as MNIST, we would never expect two handwritten digits to produce the exact same image. Hence <math>c^{\mathcal D}(x)</math> is (almost) always going to have a single index 1 and the rest 0. This has implications for our approximations:
$$ \hat F(x) \text{ is uniform for all } x \in \mathcal D $$
$$ \hat F_x(y) \text{ is degenerate for all } x \in \mathcal D $$
This clearly has implications for overfitting: to minimize the KL term in <math>l_{KL}(W)</math> we want <math>\phi(f^W(x))</math> to be very close to <math>\hat F_x(y)</math> at each point - this means that the loss function is in fact encouraging the neural network to output near degenerate distributions!

'''Label Smoothing'''

One form of regularization to help this problem is called label smoothing. Instead of using the degenerate $$F_x(y)$$ as a target function, let's "smooth" it (by adding a scaled uniform distribution to it) so it's no longer degenerate:
$$ F'_x(y) = (1-\lambda)\hat F_x(y) + \frac \lambda K \vec 1 $$

'''BNNs'''

BBNs balances the complexity of the model and the distance to target distribution without choosing a single beset configuration (one-hot encoding). Specifically, BNNs with the Gaussian Weight prior $$F_x(y) = N (0,T^{-1} I)$$ has score of configuration <math>W</math> measured by the posterior density $$p_W(W|D) = p(D|W)p_W(W), \log(p_W(W)) = T||W||^2_2$$
Here <math>||W||^2_2</math> could be a poor proxy to penalized for the model complexity due to its linear nature.

== Method ==
The main technical proposal of the paper is a Bayesian framework to estimate the (former) target distribution <math>F_x(y)</math>. That is, we construct a posterior distribution of <math> F_x(y) </math> and use that as our new target distribution. We call it the ''belief matching'' (BM) framework.

=== Constructing Target Distribution ===
Recall that <math>F_x(y)</math> is a k-categorical probability distribution - its PMF can be fully characterized by k numbers that sum to 1. Hence we can encode any such <math>F_x</math> as a point in <math>\Delta^{k-1}</math>. We'll do exactly that - let's call this vector <math>z</math>:
$$ z \in \Delta^{k-1} $$
$$ \text{prior} = p_{z|x}(z) $$
$$ \text{conditional} = p_{y|z,x}(y) $$
$$ \text{posterior} = p_{z|x,y}(z) $$
Then if we perform inference:
$$ p_{z|x,y}(z) \propto p_{z|x}(z)p_{y|z,x}(y) $$
The distribution chosen to model prior was <math>dir_K(\beta)</math>:
$$ p_{z|x}(z) = \frac{\Gamma(||\beta||_1)}{\prod_{k=1}^K \Gamma(\beta_k)} \prod_{k=1}^K z_k^{\beta_k - 1} $$
Note that by definition of <math>z</math>: <math> p_{y|x,z} = z_y </math>. Since the Dirichlet is a conjugate prior to categorical distributions we have a convenient form for the mean of the posterior:
$$ \bar{p_{z|x,y}}(z) = \frac{\beta + c^{\mathcal D}(x)}{||\beta + c^{\mathcal D}(x)||_1} \propto \beta + c^{\mathcal D}(x) $$
This is in fact a generalization of (uniform) label smoothing (label smoothing is a special case where <math>\beta = \frac 1 K \vec{1} </math>).

=== Representing Approximate Distribution ===
Our new target distribution is <math>p_{z|x,y}(z)</math> (as opposed to <math>F_x(y)</math>). That is, we want to construct an interpretation of our neural network weights to construct a distribution with support in <math> \Delta^{K-1} </math> - the NN can then be trained so this encoded distribution closely approximates <math>p_{z|x,y}</math>. Let's denote the PMF of this encoded distribution <math>q_{z|x}^W</math>. This is how the BM framework defines it:
$$ \alpha^W(x) := \exp(f^W(x)) $$
$$ q_{z|x}^W(z) = \frac{\Gamma(||\alpha^W(x)||_1)}{\sum_{k=1}^K \Gamma(\alpha_k^W(x))} \prod_{k=1}^K z_{k}^{\alpha_k^W(x) - 1} $$
$$ \to Z^W_x \sim dir(\alpha^W(x)) $$
Apply <math>\log</math> then <math>\exp</math> to <math>q_{z|x}^W</math>:
$$ q^W_{z|x}(z) \propto \exp \left( \sum_k (\alpha_k^W(x) \log(z_k)) - \sum_k \log(z_k) \right) $$
$$ \propto -l_{CE}(\phi(f^W(x)),z) + \frac{K}{||\alpha^W(x)||}KL(\mathcal U_k \;||\; z) $$
It can actually be shown that the mean of <math>Z_x^W</math> is identical to <math>\phi(f^W(x))</math> - in other words, if we output the mean of the encoded distribution of our neural network under the BM framework, it is theoretically identical to a traditional neural network.

In the limit of <math> q^W_{z|x}(z) \rightarrow p_{z|x}(z)</math>, mean of the target posterior becomes a virtual label, for which individual z ought to match. Hence, the penalty for ambiguous configuration is determined by the number of observations. Therefore, the distribution matching in BM can be thought of as '''learning to score a categorical probability''' based on the closeness of the posterior mean, in which exploitation on the closeness of information is automatically controlled by the data.

=== Distribution Matching ===

We now need a way to fit our approximate distribution from our neural network <math>q_{\mathbf{z | x}}^{\mathbf{W}}</math> to our target distribution <math>p_{\mathbf{z|x},y}</math>. The authors achieve this by maximizing the evidence lower bound (ELBO):

$$l_{EB}(\mathbf y, \alpha^{\mathbf W}(\mathbf x)) = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] - KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; p_{\mathbf{z|x}}) $$

Each term can be computed analytically:

$$\mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf W }} \left[\log z_y \right] = \psi(\alpha_y^{\mathbf W} ( \mathbf x )) - \psi(\alpha_0^{\mathbf W} ( \mathbf x )) $$

Where <math>\psi(\cdot)</math> represents the digamma function (logarithmic derivative of gamma function). Intuitively, we maximize the probability of the correct label. For the KL term:

$$KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; p_{\mathbf{z|x}}) = \log \frac{\Gamma(a_0^{\mathbf W}(\mathbf x)) \prod_k \Gamma(\beta_k)}{\prod_k \Gamma(\alpha_k^{\mathbf W}(x)) \Gamma (\beta_0)} + \sum_k (\alpha_k^{\mathbf W}(x)-\beta_k)(\psi(\alpha_k^{\mathbf W}(\mathbf x)) - \psi(\alpha_0^{\mathbf W}(\mathbf x)) $$

In the first term, for intuition, we can ignore <math>\alpha_0</math> and <math>\beta_0</math> since those just calibrate the distributions. Otherwise, we want the ratio of the products to be as close to 1 as possible to minimize the KL. In the second term, we want to minimize the difference between each individual <math>\alpha_k</math> and <math>\beta_k</math>, scaled by the normalized output of the neural network.

This loss function can be used as a drop-in replacement for the standard softmax cross-entropy, as it has an analytic form and the same time complexity as typical softmax-cross entropy with respect to the number of classes (<math>O(K)</math>).

=== On Prior Distributions ===

We must choose our concentration parameter, <math>\beta</math>, for our dirichlet prior. We see our prior essentially disappears as <math>\beta_0 \to 0</math> and becomes stronger as <math>\beta_0 \to \infty</math>. Thus, we want a small <math>\beta_0</math> so the posterior isn't dominated by the prior. But, the authors claim that a small <math>\beta_0</math> makes <math>\alpha_0^{\mathbf W}(\mathbf x)</math> small, which causes <math>\psi (\alpha_0^{\mathbf W}(\mathbf x))</math> to be large, which is problematic for gradient based optimization. In practice, many neural network techniques aim to make <math>\mathbb E [f^{\mathbf W} (\mathbf x)] \approx \mathbf 0</math> and thus <math>\mathbb E [\alpha^{\mathbf W} (\mathbf x)] \approx \mathbf 1</math>, which means making <math>\alpha_0^{\mathbf W}(\mathbf x)</math> small can be counterproductive.

So, the authors set <math>\beta = \mathbf 1</math> and introduce a new hyperparameter <math>\lambda</math> which is multiplied with the KL term in the ELBO:

$$l^\lambda_{EB}(\mathbf y, \alpha^{\mathbf W}(\mathbf x)) = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] - \lambda KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; \mathcal P^D (\mathbf 1)) $$

This stabilizes the optimization, as we can tell from the gradients:

$$\frac{\partial l_{E B}\left(\mathbf{y}, \alpha^{\mathbf W}(\mathbf{x})\right)}{\partial \alpha_{k}^{\mathbf W}(\mathbf {x})}=\left(\tilde{\mathbf{y}}_{k}-\left(\alpha_{k}^{\mathbf W}(\mathbf{x})-\beta_{k}\right)\right) \psi^{\prime}\left(\alpha_{k}^{\mathbf{W}}(\boldsymbol{x})\right)
-\left(1-\left(\alpha_{0}^{\boldsymbol{W}}(\boldsymbol{x})-\beta_{0}\right)\right) \psi^{\prime}\left(\alpha_{0}^{\boldsymbol{W}}(\boldsymbol{x})\right)$$

$$\frac{\partial l_{E B}^{\lambda}\left(\mathbf{y}, \alpha^{\mathbf{W}}(\mathbf{x})\right)}{\partial \alpha_{k}^{W}(\mathbf{x})}=\left(\tilde{\mathbf{y}}_{k}-\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})-\lambda\right)\right) \frac{\psi^{\prime}\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})\right)}{\psi^{\prime}\left(\tilde{\alpha}_{0}^{\mathbf W}(\mathbf{x})\right)}
-\left(1-\left(\tilde{\alpha}_{0}^{W}(\mathbf{x})-\lambda K\right)\right)$$

As we can see, the first expression is affected by the magnitude of <math>\alpha^{\boldsymbol{W}}(\boldsymbol{x})</math>, whereas the second expression is not due to the <math>\frac{\psi^{\prime}\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})\right)}{\psi^{\prime}\left(\tilde{\alpha}_{0}^{\mathbf W}(\mathbf{x})\right)}</math> ratio.

== Experiments ==

Throughout the experiments in this paper, the authors employ various models based on residual connections (He et al., 2016 [1]) which are the models used for benchmarking in practice. We will first demonstrate improvements provided by BM, then we will show versatility in other applications. For fairness of comparisons, all configurations in the reference implementation will be fixed. The only additions in the experiments are initial learning rate warm-up and gradient clipping which are extremely helpful for stable training of BM.

=== Generalization performance ===
The paper compares the generalization performance of BM with softmax and MC dropout on CIFAR-10 and CIFAR-100 benchmarks.

[[File:Being_Bayesian_about_Categorical_Probability_T1.png]]

The next comparison was performed between BM and softmax on the ImageNet benchmark.

[[File:Being_Bayesian_about_Categorical_Probability_T2.png]]

For both datasets and In all configurations, BM achieves the best generalization and outperforms softmax and MC dropout.

===== Regularization effect of prior =====

In theory, BM has 2 regularization effects:
The prior distribution, which smooths the target posterior
Averaging all of the possible categorical probabilities to compute the distribution matching loss
The authors perform an ablation study to examine the 2 effects separately - removing the KL term in the ELBO removes the effect of the prior distribution.
For ResNet-50 on CIFAR-100 and CIFAR-10 the resulting test error rates were 24.69% and 5.68% respectively.

This demonstrates that both regularization effects are significant since just having one of them improves the generalization performance compared to the softmax baseline, and having both improves the performance even more.

===== Impact of <math>\beta</math> =====

The effect of β on generalization performance is studied by training ResNet-18 on CIFAR-10 by tuning the value of β on its own, as well as jointly with λ. It was found that robust generalization performance is obtained for β ∈ [<math>e^{−1}, e^4</math>] when tuning β on its own; and β ∈ [<math>e^{−4}, e^{8}</math>] when tuning β jointly with λ. The figure below shows a plot of the error rate with varying β.

[[File:Being_Bayesian_about_Categorical_Probability_F3.png]]

=== Uncertainty Representation ===

One of the big advantages of BM is the ability to represent uncertainty about the prediction. The authors evaluate the uncertainty representation on in-distribution (ID) and out-of-distribution (OOD) samples.

===== ID uncertainty =====

For ID (in-distribution) samples, calibration performance is measured, which is a measure of how well the model’s confidence matches its actual accuracy. This measure can be visualized using reliability plots and quantified using a metric called expected calibration error (ECE). ECE is calculated by grouping predictions into M groups based on their confidence score and then finding the absolute difference between the average accuracy and average confidence for each group. We can define the ECE of <math>f^W </math> on <math>D </math> with <math>M</math> groups as

<center>
<math>ECE_M(f^W, D) = \sum^M_{i=1} \frac{|G_i|}{|D|}|acc(G_i) - conf(G_i)|</math>
</center>
Where <math>G_i</math> is a set of samples int the i-th group defined as <math>G_i = \{j:i/M < max_k\phi_k(f^Wx^{(j)}) \leq (1+i)/M\}</math>, <math>acc(G_i)</math> is an average accuracy in the i-th group and <math>conf(G_i)</math> is an average confidence in the i-th group.

The figure below is a reliability plot of ResNet-50 on CIFAR-10 and CIFAR-100 with 15 groups. It shows that BM has a significantly better calibration performance than softmax since the confidence matches the accuracy more closely (this is also reflected in the lower ECE).

[[File:Being_Bayesian_about_Categorical_Probability_F4.png]]

===== OOD uncertainty =====

Here, the authors quantify uncertainty using predictive entropy - the larger the predictive entropy, the larger the uncertainty about a prediction.

The figure below is a density plot of the predictive entropy of ResNet-50 on CIFAR-10. It shows that BM provides significantly better uncertainty estimation compared to other methods since BM is the only method that has a clear peak of high predictive entropy for OOD samples which should have high uncertainty.

[[File:Being_Bayesian_about_Categorical_Probability_F5.png]]

=== Transfer learning ===

Belief matching applies the Bayesian principle outside the neural network, which means it can easily be applied to already trained models. Thus, belief matching can be employed in transfer learning scenarios. The authors downloaded the ImageNet pre-trained ResNet-50 weights and fine-tuned the weights of the last linear layer for 100 epochs using an Adam optimizer.

This table shows the test error rates from transfer learning on CIFAR-10, Food-101, and Cars datasets. Belief matching consistently performs better than softmax.

[[File:being_bayesian_about_categorical_probability_transfer_learning.png]]

Belief matching was also tested for the predictive uncertainty for out of dataset samples based on CIFAR-10 as the in distribution sample. Looking at the figure below, it is observed that belief matching significantly improves the uncertainty representation of pre-trained models by only fine-tuning the last layer’s weights. Note that belief matching confidently predicts examples in Cars since CIFAR-10 contains the object category automobiles. In comparison, softmax produces confident predictions on all datasets. Thus, belief matching could also be used to enhance the uncertainty representation ability of pre-trained models without sacrificing their generalization performance.

[[File: being_bayesian_about_categorical_probability_transfer_learning_uncertainty.png]]

=== Semi-Supervised Learning ===

Belief matching’s ability to allow neural networks to represent rich information in their predictions can be exploited to aid consistency based loss function for semi-supervised learning. Consistency-based loss functions use unlabelled samples to determine where to promote the robustness of predictions based on stochastic perturbations. This can be done by perturbing the inputs (which is the VAT model) or the networks (which is the pi-model). Both methods minimize the divergence between two categorical probabilities under some perturbations, thus belief matching can be used by the following replacements in the loss functions. The hope is that belief matching can provide better prediction consistencies using its Dirichlet distributions.

[[File: being_bayesian_about_categorical_probability_semi_supervised_equation.png]]

The results of training on ResNet28-2 with consistency based loss functions on CIFAR-10 are shown in this table. Belief matching does have lower classification error rates compared to using a softmax.

[[File:being_bayesian_about_categorical_probability_semi_supervised_table.png]]

== Conclusion and Critiques ==

* Bayesian principles can be used to construct the target distribution by using the categorical probability as a random variable rather than a training label. This can be applied to neural network models by replacing only the softmax and cross-entropy loss while improving the generalization performance, uncertainty estimation and well-calibrated behavior.

* In the future, the authors would like to allow for more expressive distributions in the belief matching framework, such as logistic normal distributions to capture strong semantic similarities among class labels. Furthermore, using input dependent priors would allow for interesting properties that would aid imbalanced datasets and multi-domain learning.

* Overall I think this summary is very good. The Method(Algorithm) section is described clearly, and the Results section is detailed, with many diagrams illustrating the main points. I just have one technical suggestion: the difference in performance for SOFTMAX and BM differs by model. For example, for RESNEXT-50 model, the difference in top1 is 0.2, whereas, for the RESNEXT-100 model, the difference in the top one is 0.5, which is significantly higher. It's true that BM method generally outperforms SOFTMAX. But seeing the relation between the choice of model and the magnitude of a performance increase could definitely strengthen the paper even further.

* The summary is good and the topic is interesting. Bayesian is a well know probabilistic model but did not know that it can be used as a neural network. Comparison between softmax and bayesian was interesting and more details would be great.

* It would be better if there is a future work section to discuss the current shortage and potential improvement. One thing would be that the theoretical part is complex in the process. In addition, optimizing a function is relatively hard if the structure is complex. Is it possible to have a good approximation without having too complex a calculation?

* Both experiments dealt with image data, however, softmax is used within classification neural networks that range from image to textual data. It would be interesting to see the performance of BM on textual data for text classification problems in addition to image classification.

* It would be better to briefly explain Bayesian treatment in the introduction part(i.e., considering the categorical probability as a random variable, construct the target distribution by means of the Bayesian inference), and to analyze the importance of considering the categorical probability as a random variable (for example explain it can be adapted to existing deep learning building blocks without huge modifications).

* Interesting topic that goes close to our lectures. Since this is a summary of the paper, it would be better if trim the explanation on Neural Network a little like getting rid of the substitution lines.

* I really liked the presentation and actually really appreciate the steps of the detailed derivation that were presented in this summary. In the introduction the researchers mentioned that BM is a computationally cheap method, however, I was wondering how much faster it is computationally as opposed to the other models to train. Additionally, the training data that was used to benchmark the classification performance seemed to all be image classifications (CIFAR-10, CIFAR-100, ResNet-50, ResNet-101), thus it would have been nice to see classification be applied in other multi-class contexts as well to see how well this new method performs there.

* It would be more clear if the metric used in evaluating the models is briefly explained!

* The author's work by applying a Bayesian treatment of the output logits of the neural network to reach a more effective regularization technique is impressive. However, the author lacks a comparison of the calculation efficiency difference between the results. When dealing with large data sets, it is difficult to be convincing if there is no advantage in computational efficiency.

== Citations ==

[1] Bridle, J. S. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pp. 227–236. Springer, 1990.

[2] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. In International Conference on Machine Learning, 2015.

[3] Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 2016.

[4] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning, 2017.

[5] MacKay, D. J. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448– 472, 1992.

[6] Graves, A. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, 2011.

[7] Mandt, S., Hoffman, M. D., and Blei, D. M. Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research, 18(1):4873–4907, 2017.

[8] Zhang, G., Sun, S., Duvenaud, D., and Grosse, R. Noisy natural gradient as variational inference. In International Conference of Machine Learning, 2018.

[9] Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, 2019.

[10] Osawa, K., Swaroop, S., Jain, A., Eschenhagen, R., Turner, R. E., Yokota, R., and Khan, M. E. Practical deep learning with Bayesian principles. In Advances in Neural Information Processing Systems, 2019.

[11] Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017.

[12] Neumann, L., Zisserman, A., and Vedaldi, A. Relaxed softmax: Efficient confidence auto-calibration for safe pedestrian detection. In NIPS Workshop on Machine Learning for Intelligent Transportation Systems, 2018.

[13] Xie, L., Wang, J., Wei, Z., Wang, M., and Tian, Q. Disturblabel: Regularizing cnn on the loss layer. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[14] Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.

[15] Jospin, L. V., Buntine, W. V., Boussaid, F. V., Laga, H. V., & Bennamoun, M. V. (2020). Hands-on Bayesian Neural Networks - a Tutorial for Deep Learning Users. Association for Computing Machiner, 3-7. doi:arXiv:2007.06823

Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms

2020-12-07T09:38:04Z

Y87yu: /* Critiques */

== Presented by ==

Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee

== Introduction ==

This paper presents ConvNetQuake, an approach on detecting heart disease from ECG signals by fine-tuning the deep learning neural network. For context, ConvNetQuake is a convolutional neural network, used by Perol, Gharbi, and Denolle [4], for Earthquake detection and location from a single waveform. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to healthcare innovation, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.

== Previous Work and Motivation ==

The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several shortcomings in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo and Attarodi [2] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [3] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [1] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained.

The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.

== Model Architecture ==

The dataset, which was used to train, validate, and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.

This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers; this addition helps to combat overfitting. A second modification that was made was to introduce the use of label smoothing, which can help by discouraging the model from making overconfident predictions. Label smoothing refers to the method of relaxing the confidence on the model's prediction labels. The authors' experiments demonstrated that both of these modifications helped to increase model accuracy.

During the training stage, a 10-second long two-channel input was fed into the neural network. In order to ensure that the two channels were weighted equally, both channels were normalized. Besides, time invariance was incorporated by selecting the 10-second long segment randomly from the entire signal.

The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.

This model is trained by using batches of size 10. The learning rate is <math>10^{-4}</math>. The ADAM optimizer is used. The ADAM (adaptive moment estimation) optimizer is a stochastic gradient optimization method that uses adaptive learning rates for the parameters used in the estimating the gradient's first and second moments [5]. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.

During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.

The following images gives a visual representation of the model.

[[File:architecture.png | thumb | center | 1000px | Model Architecture (Gupta et al., 2019)]]

==Results==

The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting in the highest individual accuracies: v5, v6, vx, vz, and ii. The researchers further investigated the accuracies for pairs of the top 5 highest individual channels using 20-fold cross-validation. They arrived at the conclusion that the highest pairs accuracies to feed into a neural network are lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly.

Next, they discussed 2 factors affecting model performance evaluation: 1） Random train-val-test split might have effects on the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of the neural network shows little effects on the performance of the model performance evaluation, because of showing high average results with a fixed train-val-test split.

Comparing with other models in the other 12 papers, the model in this article has the highest accuracy, specificity, and precision. The dataset contained 549 records from 290 unique patients. In order to ensure that the model did not overfit specific patient profiles, they performed a patient-wise split, where all records associated with a given patient are either in test data or train data (but not both). They tested the 290 fold patient-wise split, resulting in the same highest accuracy of the pair v6 and vz same as record-wise split. The second best pair was ii and vz, which also contains the vz channel. Combining the two best pair channels into v6, vz, vii ultimately gave the best results over 10 trials which has an average of 97.83% in patient-wise split. Even though the patient-wise split might result in lower accuracy evaluation, however, it still maintains a very high average.

==Conclusion & Discussion==

The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infarction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to a larger labeled data set is needed. The PTB database is small. It is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.

A detailed analysis of the relative importance of each of the 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz, and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infarction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.

Fourier Transform (such as FFT) can be helpful when dealing with ECG signals. It transforms signals from the time domain to the frequency domain, which means some hidden features in frequency may be discovered.

A limitation specified by the authors is the lack of labeled data. The use of a small dataset such as PTB makes it difficult to determine the robustness of the model due to the small size of the test set. Given a larger dataset, the model could be tested to see if it generalizes to identify heart conditions other than myocardial infarction.

==Critiques==
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.

- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.

- I feel that the results of deep learning models in medical settings where the consequences of misclassification can be severe should be evaluated by assigning weights to classification. In case if the misclassification can lead to severe consequences, then the network should be trained in such a way that it errs towards safety. For example, in case if heart disease, the consequences will be very high if the system says that there is no heart disease when in fact there is. So, the evaluation metric must be selected carefully.

- This is a useful and meaningful application topic in machine learning. Using Deep Learning to detect heart disease can be very helpful if it is difficult to detect disease by looking at ECG by humans eys. This model also useful for doing statistics, such as calculating the percentage of people get heart disease. But I think the doctor should not 100% trust the result from the model, it is almost impossible to get 100% accuracy from a model. So, I think double-checking by human eyes is necessary if the result is weird. What is more, I think it will be interesting to discuss more applications in mediccal by using this method, such as detecting the Brainwave diagram to predict a person's mood and to diagnose mental diseases.

- Compared to the dataset for other topics such as object recognition, the PTB database is pretty small with only 549 ECG records. And these are highly unbiased (Table 1) with 4 records for myocarditis and 148 for myocardial infarction. Medical datasets can only be labeled by specialists. This is why these datasets are related small. It would be great if there will be a larger, more comprehensive dataset.

- Only results using 20-fold cross validation were presented. It should be shown that the results could be reproduced using a more common number of folds like 5 or 10

- There are potential issues with the inclusion of Frank leads. From a practitioner standpoint, ECGs taken with Frank leads are less common. This could prevent the use of this technique. Additionally, Frank leads are expressible as a linear combinations of the 12 traditional leads. The authors are not adding any fundamentally new information by including them and their inclusion could be viewed as a form of feature selection (going against the authors' original intentions).

- It will better if we can see how the model in this paper outperformed those methods that used feature selections. The details of the results are not enough.

- A new extended dataset for PTB dubbed [https://www.nature.com/articles/s41597-020-0495-6 PTB-XL], has 21837 records. Using this dataset could yield a more accurate result, since the original PTB's small dataset posed limitations on the deep learning model.

- The paper mentions that it has better results, but by how much? what accuracy did the methods you compared to have? Also, what methods did you compare to? (Authors mentioned feature engineering methods but this is vague) Also how much were the labels smoothed? (i.e. 1 -> 0.99 or 1-> 0.95 for example) How much of a difference did the label smoothing make?

- It is nice to see that the authors also considered training and testing the model on data via a patient-wise split, which gives more insights towards the cases when a patient has multiple records of diagnosis. Obviously and similar to what other critiques suggested, using a patient-wise split might disadvantage from the lack of training data, given that there are only 290 unique patients in the PTB database. Also, acquiring prior knowledge from professionals about correlations, such as causal relationships, between different diagnoses might be helpful for improving the model.

- As mentioned above, the dataset is comparably small in the context of machine learning. While on the other hand, each record has a length of roughly 100 seconds, which is significantly large as a single input. Therefore, it might be helpful to apply data augmentation algorithms during data preprocessing sections so that there will be a more reasonable dataset than what we currently have so far, which has a high chance of being biased or overfitted.

- There are several points from the Model Architecture section that can be improved. It mentions that both 1d batch normalization layers and label smoothing are used to improve the accuracy of the models, based on empirical experiment results. Yet, there's no breakdown of how each of these two method improves the accuracy. So it's left unclear whether each method is significant on its own, or the model simultaneously requires both methods in order to achieve improved accuracy. Some more data can be provided about this. It's mentioned that "models are trained from scratch numerous times." How many times is numerous times? Can we get the exact number? Training time about the models should also be provided. This is because if these models take a long time to train, then training them from scratch every time may cause issues with respect to runtime.

- The authors should have indicated how much the accuracy has been improved by what method. It is a little unclear that how can we define "better results". Also, this paper could be more clear if they included the details about the Model Architecture such as how it was performed and how long was the training time for the model.

- The summary is lacking several components such as explanation of model, data-preprocessing, result visualization and such. It is hard to understand how the result improved since there is no comparison. Information about dataset is unclear too, it is not explained well what they are and how they are populated.

- The authors didn't specify how many epochs the model ran for. A common practice when dealing with small datasets is to run more epochs at the risk of overfitting. However the use of batch normalization (and perhaps the introduction of Dropout layers) aid in preventing the model to overfitting the data or affirming the bias of the dataset so more epochs may have improved performance in this case.

- It is difficult to justify the effectiveness of deep learning for detecting myocardial infarction in EKG due to the lack of information available on the deep learning structure. Meanwhile, false negatives and false positives must be as close to 0 as possible, therefore the authors should test their algorithm on a variety of datasets before determining if deep learning is effective.

- The authors do not motivate the use of ConvNetQuake as their baseline model for deep transfer learning. There are likely several other model candidates that perform similar signal processing related tasks such as CNN models for gravitational wave detection.

- Further there is very limited mention of the ECG data used, and what features are of interest. For someone who has limited domain-knowledge about Myocardial Infarctions and ECGs, it is hard to interpret and relate the information both in the original paper and the summary. There is large use of medical terminology that the average student is not likely to know. The absence of concrete data and results leads to a lot of confusion for someone trying to understand the relevance of the model to ECGs.

- Although the application prospects mentioned by the author are exciting, the model still faces many improvements. First, misdiagnosis in medical examination is a very serious medical malpractice, thus a confusion matrix should be added in the model robustness. Either false positive and false negative testing results should be considered. Second, the data set is still facing the issue of small size. It is recommended that the author carry out long-term tracking and supplementation of the data on different heart diseases in order to form more robust conclusions in the future.

== References ==

[1] Na Liu et al. "A Simple and Effective Method for Detecting Myocardial Infarction Based on Deep Convolutional Neural Network". In: Journal of Medical Imaging and Health Informatics (Sept. 2018). doi: 10.1166/jmihi.2018.2463.

[2] Naser Safdarian, N.J. Dabanloo, and Gholamreza Attarodi. "A New Pattern Recognition Method for Detection and Localization of Myocardial Infarction Using T-Wave Integral and Total Integral as Extracted Features from One Cycle of ECG Signal". In: J. Biomedical Science and Engineering (Aug. 2014). doi: http://dx.doi.org/10.4236/jbise.2014.710081.

[3] L.D. Sharma and R.K. Sunkaria. "Inferior myocardial infarction detection using stationary wavelet transform and machine learning approach." In: Signal, Image and Video Processing (July 2017). doi: https://doi.org/10.1007/s11760-017-1146-z.

[4] Perol Thibaut, Gharbi Michaël, and Denolle Marin. "Convolutional neural network for earthquake detection and location". In: Science Advances (Feb. 2018). doi: 10.1126/sciadv.1700578

[5] Kingma, D. and Ba, J., 2015. Adam: A Method for Stochastic Optimization. In: International Conference for Learning Representations. [online] San Diego: 3rd International Conference for Learning Representations, p.1. Available at: <https://arxiv.org/pdf/1412.6980.pdf> [Accessed 3 December 2020].

Evaluating Machine Accuracy on ImageNet

2020-12-07T09:26:19Z

Y87yu: /* Critiques */

== Presented by ==
Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du

== Introduction ==
ImageNet is one of the most influential dataset in machine learning with images and corresponding labels over 1000 classes. This paper intends to explore the causes for performance differences between human experts and machine learning models, more specifically, CNN, on ImageNet.

Firstly, some images could belong to multiple classes. As a result, it is possible to underestimate the performance if we assign each image with only one label, which is what is being done in the top-1 metric. On the other hand, the top-5 metric looks at the top five predictions by the model for an image and checks if the target label is within those five predictions (Krizhevsky, Sutskever, & Hinton). Therefore, we adopt both top-1 and top-5 metrics where the performances of models, unlike human labelers, are linearly correlated in both cases.

Secondly, in contrast to the uniform performance of models in classes, humans tend to achieve better performances on inanimate objects. Human labelers achieve similar overall accuracies as the models, which indicates spaces of improvements on specific classes for machines.

Lastly, the setup of drawing training and test sets from the same distribution may favor models over human labelers. That is, the accuracy of multi-class prediction from models drops when the testing set is drawn from a different distribution than the training set, ImageNetV2. But this shift in distribution does not cause a problem for human labelers.

== Experiment Setup ==
=== Overview ===
There are four main phases to the experiment, which are (i) initial multilabel annotation, (ii) human labeler training, (iii) human labeler evaluation, and (iv) final annotation overview. The five authors of the paper are the participants in the experiments.

A brief overview of the four phases is as follows:
[[File:Experiment Set Up.png |800px| center]]

=== Initial multi-label annotation ===
Three labelers A, B, and C provided multi-label annotations for a subset of size 20,000 from the ImageNet validation set, and all 20,683 images from the ImageNetV2 test sets. These experiences give A, B, and C extensive experience with the ImageNet dataset.

=== Human Labeler Training ===
All five labelers trained on labeling a subset of the remaining ImageNet images. "Training" the human labelers consisted of teaching the humans the distinctions between very similar classes in the training set. For example, there are 118 classes of "dog" within ImageNet and typical human participants will not have working knowledge of the names of each breed of dog seen even if they can recognize and distinguish that breed from others. Local members of the American Kennel Club were even contacted to help with dog breed classification. To do this labelers were trained on class-specific tasks for groups like dogs, insects, monkeys beaver and others. They were also given immediate feedback on whether they were correct and then were asked where they thought they needed more training to improve. Unlike the two annotators in (Russakovsky et al., 2015), who had insufficient training data, the labelers in this experiment had up to 100 training images per class while labeling. This allowed the labelers to really understand the finer details of each class.

=== Human Labeler Evaluation ===
Class-balanced random samples, which contain 1,000 images from the 20,000 annotated images are generated from both the ImageNet validation set and ImageNetV2. Five participants labeled these images over 28 days.

=== Final annotation Review ===
All labelers reviewed the additional annotations generated in the human labeler evaluation phase.

== Multi-label annotations==
[[File:Categories Multilabel.png|800px|center]]
<div align="center">Figure 3</div>

===Top-1 accuracy===
With Top-1 accuracy being the standard accuracy measure used in classification studies, it measures the proportions of examples for which the predicted label matches the single target label. As many images often contain more than one object for classification, for example, Figure 3a contains a desk, laptop, keyboard, space bar, and more. With Figure 3b showing a centered prominent figure yet labeled otherwise (people vs picket fence), it can be seen how a single target label is inaccurate for such a task since identifying the main objects in the image does not suffice due to its overly stringent and punishes predictions that are the main image yet does not match its label.
===Top-5 accuracy===
With Top-5 considers a classification correct if the object label is in the top 5 predicted labels. Although it partially resolves the problem with Top-1 labeling, it is still not ideal since it can trivialize class distinctions. For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations.

===Multi-label accuracy===
The paper then proposes that for every image, the image shall have a set of target labels and a prediction; if such prediction matches one of the labels, it will be considered as correct labeling. Due to the above-discussed limitations of Top-1 and Top-5 metrics, the paper claims it is necessary for rigorous accuracy evaluation on the dataset.

===Types of Multi-label annotations===
====Multiple objects or organisms====
For the images containing more than one object or organism that corresponds to ImageNet, the paper proposed to add an additional target label for each entity in the image. With the discussed image in Figure 3b, the class groom, bow tie, suit, gown, and hoopskirt are all present in the foreground which is then subsequently added to the set of labels.
====Synonym or subset relations====
For similar classes, the paper considers them as under the same bigger class, that is, for two similarly labeled images, classification is considered correct if the produced label matches either one of the labels. For instance, warthog, African elephant, and Indian element all have prominent tusks, they will be considered subclasses of the tusker, Figure 3c shows a modification of labels to contain tusker as a correct label.
====Unclear Image====
In certain cases such as Figure 3d, there is a distinctive difficulty to determine whether a label was correct due to ambiguities in the class hierarchy, such as differentiating between a lakeshore or a seashore.

===Collecting multi-label annotations===
Participants reviewed all predictions made by the models on the dataset ImageNet and ImageNet-V2, the participants then categorized every unique prediction made by the models on the dataset into correct and incorrect labels in order to allow all images to have multiple correct labels to satisfy the above-listed method.
===The multi-label accuracy metric===
One prediction is only correct if and only if it was marked correct by the expert reviewers during the annotation stage. As discussed in the experiment setup section, after human labelers have completed labeling, a second annotation stage is conducted. In Figure 4, a comparison of Top-1, Top-5, and multi-label accuracies showed higher Top-1 and Top-5 accuracy corresponds with higher multi-label accuracy as expected. With multi-label accuracies measures consistently higher than Top-1 yet lower than Top-5 which shows a high correlation between the three metrics, the paper concludes that multi-label metrics measures a semantically more meaningful notion of accuracy compared to its counterparts.

== Human Accuracy Measurement Process ==
=== Bias Control ===
Since three participants participated in the initial round of annotation, they did not look at the data for six months, and two additional annotators are introduced in the final evaluation phase to ensure fairness of the experiment.

=== Human Labeler Training ===
The three main difficulties encountered during human labeler training are fine-grained distinctions, class unawareness, and insufficient training images. Thus, three training regimens are provided to address the problems listed above, respectively. First, labelers will be assigned extra training tasks with immediate feedbacks on similar classes. Second, labelers will be provided access to search for specific classes during labeling. Finally, the training set will contain a reasonable amount of images for each class. There are difficult class distinctions for humans especially with the fine-grained distinction. To help humans perform well on these class distinction, the training tasks only contained images from certain animal families.

=== Labeling Guide ===
A labeling guide is constructed to distill class analysis learned during training into discriminative traits that could be used as a reference during the final labeling evaluation.

=== Final Evaluation and Review ===
Two samples, each containing 1000 images, are sampled from ImageNet and ImageNetV2, respectively, They are sampled in a class-balanced manner and shuffled together. Over 28 days, all five participants labeled all images. They spent a median of 26 seconds per image. After labeling is completed, an additional multi-label annotation session was conducted, in which human predictions for all images are manually reviewed. Comparing to the initial round of labeling, 37% of the labels changes due to participants' greater familiarity with the classes.

== Main Results ==
[[File:Evaluating Machine Accuracy on ImageNet Figure 1.png | center]]

<div align="center">Figure 1</div>

===Comparison of Human and Machine Accuracies on Image Net===
From Figure 1, we can see that the difference in accuracies between the datasets is within 1% for all human participants. As hypothesized, human testers indeed performed better than the automated models on both datasets. It's worth noticing that labelers D and E, who did not participate in the initial annotation period, actually performed better than the best automated model.
===Comparison of Human and Machine Accuracies on Image Net===
Based on the results shown in Figure 1, we can see that the confidence interval of the best 4 human participants and 4 best model overlap; however, with a p-value of 0.037 using the McNemar's paired test, it rejects the hypothesis that the FixResNeXt model and Human E labeler have the same accuracy with respect to the ImageNet validation dataset. Figure 1 also shows that the confidence intervals of the labeling accuracies for human labelers C, D, E do not overlap with the confidence interval of the best model with respect to ImageNet-V2 and with the McNemar's test yielding a p-value of <math>2\times 10^{-4}</math>, it is clear that the hypothesis human and machined models have same robustness to model distribution shifts ought to be rejected.

== Other Observations ==

[[File: Results_Summary_Table.png| 800px|center]]

=== Difficult Images ===

The experiment also shed some light on images that are difficult to label. 10 images were misclassified by all of the human labelers. Among those 10 images, there was 1 image of a monkey and 9 of dogs. In addition, 27 images, with 19 in object classes and 8 in organism classes, were misclassified by all 72 machine learning models in this experiment. Only 2 images were labeled wrong by all human labelers and models. Both images contained dogs. Researchers also noted that difficult images for models are mostly images of objects and exclusively images of animals for human labelers.

=== Accuracies without dogs ===

As previously discussed in the paper, machine learning models tend to outperform human labelers when classifying the 118 dog classes. To better understand to what extent does models outperform human labelers, researchers computed the accuracies again by excluding all the dog classes. Results showed a 0.6% increase in accuracy on the ImageNet images using the best model and a 1.1% increase on the ImageNet V2 images. In comparison, the mean increases in accuracy for human labelers are 1.9% and 1.8% on the ImageNet and ImageNet V2 images respectively. Researchers also conducted a simulation to demonstrate that the increase in human labeling accuracy on non-dog images is significant. This simulation was done by bootstrapping to estimate the changes in accuracy when only using data for the non-dog classes, and simulation results show smaller increases than in the experiment.

In conclusion, it's more difficult for human labelers to classify images with dogs than it is for machine learning models.

=== Accuracies on objects ===
Researchers also computed machine and human labelers' accuracies on a subset of data with only objects, as opposed to organisms, to better illustrate the differences in performance. This test involved 590 object classes. As shown in the table above, there is a 3.3% and 3.4% increase in mean accuracies for human labelers on the ImageNet and ImageNet V2 images. In contrast, there is a 0.5% decrease in accuracy for the best model on both ImageNet and ImageNet V2. This indicates that human labelers are much better at classifying objects than these models are.

=== Accuracies on fast images ===
Unlike the CNN models, human labelers spent different amounts of time on different images, spanning from several seconds to 40 minutes. To further analyze the images that take human labelers less time to classify, researchers took a subset of images with median labeling time spent by human labelers of at most 60 seconds. These images were referred to as "fast images". There are 756 and 714 fast images from ImageNet and ImageNet V2 respectively, out of the total 2000 images used for evaluation. Accuracies of models and humans on the fast images increased significantly, especially for humans.

This result suggests that human labelers know when an image is difficult to label and would spend more time on it. It also shows that the models are more likely to correctly label images that human labelers can label relatively quickly.

== Related Work ==

=== Human accuracy on ImageNet ===

Russakovsky et al. (2015) studied two trained human labelers' accuracies on 1500 and 258 images in the context of the ImageNet challenge. The top-5 accuracy of the labeler who labeled 1500 images was the well-known human baseline on ImageNet.

As introduced before, the researchers went beyond by using multi-label accuracy, using more labelers, and focusing on robustness to small distribution shifts. Although the researchers had some different findings, some results are also consistent with results from (Russakovsky et al., 2015). An example is that both experiments indicated that it takes human labelers around one minute to label an image. The time distribution also has a long tail, due to the difficult images as mentioned before.

=== Human performance in computer vision broadly ===
There are many examples of recent studies about humans in the area of computer vision, such as investigating human robustness to synthetic distribution change (Geirhos et al., 2017) and studying what characteristics do humans use to recognize objects (Geirhos et al., 2018). Other examples include the adversarial examples constructed to fool both machines and time-limited humans (Elsayed et al., 2018) and illustrating foreground/background objects' effects on human and machine performance (Zhu et al., 2016).

=== Multi-label annotations ===
Stock & Cissé (2017) also studied ImageNet's multi-label nature, which aligns with the researchers' study in this paper. According to Stock & Cissé (2017), the top-1 accuracy measure could underestimate multi-label by up to 13.2%. The authors suggest that releasing these labelled data to the public will allow for more robust models in the future.

=== ImageNet inconsistencies and label error ===
Researches have found and recorded some incorrectly labeled images from ImageNet and ImageNet V2 during this study. Earlier studies (Van Horn et al., 2015) also shown that at least 4% of the birds in ImageNet are misclassified. This work also noted that the inconsistent taxonomic structure in birds' classes could lead to weak class boundaries. Researchers also noted that the majority of the fine-grained organism classes also had similar taxonomic issues.

=== Distribution shift ===
There has been an increasing amount of studies in this area. One focus of the studies is distributionally robust optimization (DRO), which finds the model that has the smallest worst-case expected error over a set of probability distributions. Another focus is on finding the model with the lowest error rates on adversarial examples. Work in both areas has been productive, but none was shown to resolve the drop in accuracies between ImageNet and ImageNet V2. A recent [https://papers.nips.cc/paper/2019/file/8558cb408c1d76621371888657d2eb1d-Paper.pdf paper] also discusses quantifying uncertainty under a distribution shift, in other words whether the output of probabilistic deep learning models should or should not be trusted.

== Conclusion and Future Work ==

=== Conclusion ===
Researchers noted that in order to achieve truly reliable machine learning, they need a deeper understanding of the range of parameters where the model still remain robust. Techniques from Combinatorics and sensitivity analysis, in particular, might yield fruitful results. This study has provided valuable insights into the desired robustness properties by comparing model performance to human performance. This is especially evident given the results of the experiment which show humans drastically outperforming machine learning in many cases and proposes the question of how much accuracy one is willing to give up in exchange for efficiency. The results have shown that current performance benchmarks are not addressing the robustness to small and natural distribution shifts, which are easily handled by humans.

=== Future work ===
Other than improving the robustness of models, researchers should consider investigating if less-trained human labelers can achieve a similar level of robustness to distributional shifts. In addition, researchers can study the robustness to temporal changes, which is another form of natural distribution shift (Gu et al., 2019; Shankar et al., 2019). Also, Convolutional Neural Network can be a candidate to improve the accuracy of classifying images.

== Critiques ==
# The method of using human to classify Imagenet is fully circular, since the label of imagenet itself is originally annotated by human beings. In fact, the classification scheme itself is intrinsically human construction. It is not logical to test human performance with human performance. This circular contsruction completely violates scientific principles.
# Table 1 simply showed a difference in ImageNet multi-label accuracy yet does not give an explicit reason as to why such a difference is present. Although the paper suggested the distribution shift has caused the difference, it does not give other factors to concretely explain why the distribution shift was the cause.
# With the recommendation to future machine evaluations, the paper proposed to "Report performances on dogs, other animals, and inanimate objects separately.". Despite its intentions, it is narrowly specific and requires further generalization for it to be convincing.
# With choosing human subjects as samplers, no further information was given as to how they are chosen nor there are any background information was given. As it is a classification problem involving many classes as specific to species, a biology student would give far more accurate results than a computer science student or a math student. To make this study more plausible, more human labelers should be sampled.
# As explaining the importance of multi-label metrics using comparison to Top-5 metric, the turtle example falls within the overall similarity (simony) classification of the multi-label evaluation metric, as such, if the Top-5 evaluation suggests any one of the turtle species were selected, the algorithm is considered to produce a correct prediction which is the intention. The example does not convey the necessity of changing to the proposed metric over the Top-5 metric.
# With the definition in the paper regarding multi-label metrics, it is hard to see why expanding the label set is different from a traditional Top-5 metric or rather necessary, ergo does not yield the claim which the proposed metric is necessary for rigorous accuracy evaluation on ImageNet.
# When discussing the main results, the paper discusses the hypothesis on distribution shift having no effects on human and machine model accuracies; the presentation is poor at best with no clear centric to what they are trying to convey to how (in detail) they resulted in such claims.
# In the experiment setup of the presentation, there are a lot of key terms without detailed description. For example, Human labeler training using a subset of the remaining 30,000 unannotated images in the ImageNet validation set, labelers A, B, C, D, and E underwent extensive training to understand the intricacies of fine-grained class distinctions in the ImageNet class hierarchy. Authors should clarify each key term in the presentation otherwise readers are hard to follow.
# Not sure how the human samplers were determined and simply picking several people will have really high bias because the sample is too small and they have different background which will definitely affect the results a lot. Also, it will be better if there are more comparisons between the model introduced and other models.
# Given the low amount of human participants, it is hard to take the results seriously (there is too much variance). Also it's not exactly clear how the authors determined that the multi-label accuracy metric measures a semantically more meaningful notion of accuracy compared to its counterparts. For example, one of the issues with top-5 accuracy that they mention is: "For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations." But it's not clear how multi-label accuracy would be better in this instance.
# It is unclear how well the human labeler can perform labeling after training. So the final result is not that trust-worthy.
# In this experiment set up, label annotators are the same as participants of the experiments. Even if there's a break between the annotating and evaluating human labeler evaluation, the impact of the break in reducing bias is not clear. One potential human labeling data is google's "I'm not a robot" verification test. One variation of the verification test asks users to select all the photos from 9 images that are related to a certain keyword. This allows for a more accurate measurement of human performance vs ImageNet performance. In addition, it's going to reduce the biases from the small number of experiment participants.
# Following Table 2, the authors appear to try and claim that the model is better than the human labelers, simply because the model experienced a better increase in classification following the removal of dog photos then the human labeler did, however, a quick look at the table shows that most human labelers still performed better than the best model. The authors should be making the claim that human labelers are better at labeling dogs than the modal, but are still better overall after removing the dogs dataset.
# The reason why human labeler outperforms CNN could be human had much more training. It would be more convincing if the paper could provide a metric in order to measure human labelers' training data set size.
# Actually, in the multi-label case, it is vague to determine whether the machine learning model or the human labellers were giving the correct label. The structure of the dataset is pretty essential in training a network, in which data with uncertain label (even determined by human) should be avoided.
# The authors mentioned that untrained labelers will likely be in lower accuracy, they can give a standard or definition about a well-trained labeler.
# I believe the authors needed to include more information about how they determined the samples such as human samplers, and also more details on how to define unclear images.
# It would be more convincing if the author could provide the criteria of being human samplers and unclear images, and the accuracy of the human labeler.
# The summary only explains some model components but does not thoroughly goes through the big picture of the model; data-preprocessing, training, and prediction procedures. It would be nice to know the details as well.
# It seems the core problem is more about the dataset itself and not the evaluation procedure. We would not have issues with top 1 and top 5 if Imagenet contained discernable classes with good labels. Of course, this is very expensive, and imagenet is an _excellent_ dataset given these constraints. It does not seem like their proposed solution, multiple labels per image, addresses their concerns properly, as other critiques have already mentioned. Furthermore, having multiple labels per image does not translate to real-life value the same way that the top 5 or top 1 metric does, as in the common case, there is one right answer for a classification problem.
# The paper could provide details on ways to improve the accuracy and robustness of the model. Since the paper mentions CNN, it could provide details of the model and why CNN is a good candidate.
# The accuracy of the model is directly correlated with how the images are labelled. In all multi-label annotations, the authors describe a predicted label as correct if it is within a set of "correct labels" where each image has a different number of correct labels. Perhaps it would yield better results if the model were to first identify the number of objects in the image first and then by using some form of criteria, it labels those identified objects in order of importance (i.e. objects that are closer are labelled first). The authors also never specified what criteria the model uses to "pick out" which object it will label in the image.
# The paper mentions difficult images and fast images. It would be better if the paper had generalized the type of images that constitute difficult images (i.e., the paper mentions 118 dog classes, what are some general characteristics of difficult images?) In addition, it would be interesting to compare the performance between human and machine accuracy on non-fast images.
# The paper meaingfully and correctly points out the problem that the current evaluation of ML algorithm by only using the accuracy on Imagnet as the bechmark is simplistic and probelmatic. However, the idea of comparing human performance with ML models is problematic, since it is hard or even impossible to control the various variables that can drastically change human performance: training time, domain knowledge, cognitive function, amount of work-load, and various environmental factors. In order to compare different experimental methods, the most important step is to carefully control the confounding variables to reach any meaningful conclusion.
# The original question would be how ImageNet was created? The images can only be labeled by human labelers. There might be some mistakes and missing details. The combination of human evaluation and machine evaluation might be helpful to resolve this problem.
# Regarding the difficult problems, CatsandDogs is a very popular dataset that in this day it is trivial for a non-deep CNN to classify with high accuracies. The author would do well to test this CNN with other classes, since its inconceivable that a CNN cannot learn the characteristics of a dog with relation to the other classes in Imagenet. Why would it be able to differentiate the breeds of dogs, but not other organisms?
# The model is limited by the representativeness of small samples and human participants. For the evaluation performance of the human group, the randomness of the sampling of participants should be confirmed first. Or the author need to formulate a rule to compare the accuracy of the human group and the machine group under this framework, to make sure the result is robustness.

== Reference ==
[1] Shankar, V., Roelofs, R., Mania, H., Fang, A., Recht, B., & Schmidt, L. (2020). Evaluating Machine Accuracy on ImageNet. ICML 2020.

[2] Krizhevsky, A., Sutskever, I., & Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. 2. Retrieved from http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

stat441F21

2020-12-01T04:16:07Z

Y87yu: /* Paper presentation */

== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==



= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=

Use the following notations:

P: You have written a summary/critique of the paper.

T: You had a technical contribution on a paper (excluding the paper that you present).

E: You had an editorial contribution on a paper (excluding the paper that you present).

=Paper presentation=
{| class="wikitable"

{| border="1" cellpadding="3"
|-
|width="60pt"|Date
|width="250pt"|Name
|width="15pt"|Paper number
|width="700pt"|Title
|width="15pt"|Link to the paper
|width="30pt"|Link to the summary
|width="30pt"|Link to the video
|-
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]
|-
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||
[https://www.youtube.com/watch?v=TVLpSFYgF0c&feature=youtu.be]
|-
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Influenza_Forecasting_Framework_based_on_Gaussian_Processes Summary]|| [https://www.youtube.com/watch?v=HZG9RAHhpXc&feature=youtu.be]
|-
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin Summary] || [https://learn.uwaterloo.ca/d2l/ext/rp/577051/lti/framedlaunch/6ec1ebea-5547-46a2-9e4f-e3dc9d79fd54]
|-
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Streaming_Bayesian_Inference_for_Crowdsourced_Classification Summary] || [https://www.youtube.com/watch?v=UgVRzi9Ubws]
|-
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| Neural Ordinary Differential Equations || [https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_ODEs Summary]||
|-
|Week of Nov 16 || Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun || 6|| Adversarial Attacks on Copyright Detection Systems || Paper [https://proceedings.icml.cc/static/paper_files/icml/2020/1894-Paper.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems Summary] || [https://www.youtube.com/watch?v=bQI9S6bCo8o]
|-
|Week of Nov 16 || Casey De Vera, Solaiman Jawad || 7|| IPBoost – Non-Convex Boosting via Integer Programming || [https://arxiv.org/pdf/2002.04679.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=IPBoost Summary] || [https://www.youtube.com/watch?v=4DhJDGC4pyI&feature=youtu.be]
|-
|Week of Nov 16 || Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran || 8|| What Game Are We Playing? End-to-end Learning in Normal and Extensive Form Games || [https://arxiv.org/pdf/1805.02777.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playing Summary] || [https://www.youtube.com/watch?v=9qJoVxo3hnI&feature=youtu.be]
|-
|Week of Nov 16 || Yuchuan Wu || 9|| || || ||
|-
|Week of Nov 16 || Zhou Zeping, Siqi Li, Yuqin Fang, Fu Rao || 10|| A survey of neural network-based cancer prediction models from microarray data || [https://www.sciencedirect.com/science/article/pii/S0933365717305067 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Y93fang Summary] || [https://youtu.be/B8pPUU8ypO0]
|-
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou Summary] || [https://www.youtube.com/watch?v=tvCEvvy54X8&ab_channel=JJLian]
|-
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Networks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Cvmustat Summary] || [https://www.youtube.com/watch?v=or5RTxDnZDo]
|-
|Week of Nov 23 || Taohao Wang, Zeren Shen, Zihao Guo, Rui Chen || 13|| Large Scale Landmark Recognition via Deep Metric Learning || [https://arxiv.org/pdf/1908.10192.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:T358wang Summary] || [https://www.youtube.com/watch?v=K9NypDNPLJo Video]
|-
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data Summary] || [https://youtu.be/i_5PQdfuH-Y]
|-
|Week of Nov 23 || Rui Gong, Xuetong Wang, Xinqi Ling, Di Ma || 15|| Semantic Relation Classification via Convolution Neural Network|| [https://www.aclweb.org/anthology/S18-1127.pdf paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification——via_Convolution_Neural_Network Summary]|| [https://www.youtube.com/watch?v=m9o3NuMUKkU&ab_channel=DiMa video]
|-
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| Graph Structure of Neural Networks || [https://proceedings.icml.cc/paper/2020/file/757b505cfd34c64c85ca5b5690ee5293-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_Networks Summary] || [https://youtu.be/x9eUgEwntcs Video]
|-
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker Summary]|| [https://www.youtube.com/watch?v=kazqcOwbtTI Video]
|-
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Point-of-Interest_Recommendation:_Exploiting_Self-Attentive_Autoencoders_with_Neighbor-Aware_Influence Summary] || [https://www.youtube.com/watch?v=aAwjaos_Hus Video]
|-
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN Summary]|| [https://youtu.be/vOENmt9jgVE Video]
|-
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Yktan Summary]|| [https://www.youtube.com/watch?v=48xYZXifjS0&ab_channel=SeakraChin]
|-
|Week of Nov 30 || Banno Dion, Battista Joseph, Kahn Solomon || 21|| Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks || [https://www.sciencedirect.com/science/article/pii/S1877050919310646] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Music_Recommender_System_Based_using_CRNN#Evaluation_of_Music_Recommendation_System: Summary] || [https://youtu.be/eGUV3zwLwqQ]
|-
|Week of Nov 30 || Isaac Ellmen, Dorsa Mohammadrezaei, Emilee Carson || 22|| A universal SNP and small-indel variant caller using deep neural networks||[https://www.nature.com/articles/nbt.4235.epdf?author_access_token=q4ZmzqvvcGBqTuKyKgYrQ9RgN0jAjWel9jnR3ZoTv0NuM3saQzpZk8yexjfPUhdFj4zyaA4Yvq0LWBoCYQ4B9vqPuv8e2HHy4vShDgEs8YxI_hLs9ov6Y1f_4fyS7kGZ Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_universal_SNP_and_small-indel_variant_caller_using_deep_neural_networks Summary] ||
|-
|Week of Nov 30 || Daniel Fagan, Cooper Brooke, Maya Perelman || 23|| Efficient kNN Classification With Different Number of Nearest Neighbors || [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7898482 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Dfagan Summary]|| [https://youtu.be/_STVyvm_Kao]
|-
|Week of Nov 30 || Karam Abuaisha, Evan Li, Jason Pu, Nicholas Vadivelu || 24|| Being Bayesian about Categorical Probability || [https://proceedings.icml.cc/static/paper_files/icml/2020/3560-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Being_Bayesian_about_Categorical_Probability Summary] || [https://drive.google.com/file/d/1I0uYF2xEMuNVtaEhPb_vZ6bxSKMi0gUh/view?usp=sharing]
|-
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || Summary [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Loss_Function_Search_for_Face_Recognition] || [https://youtu.be/i3dXnK9HGSQ]
|-
|Week of Nov 30 ||Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee || 26|| Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms || [https://arxiv.org/pdf/1912.07618.pdf?fbclid=IwAR0RwATSn4CiT3qD9LuywYAbJVw8YB3nbex8Kl19OCExIa4jzWaUut3oVB0 Paper] || Summary [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&fbclid=IwAR1Tad2DAM7LT6NXXuSYDZtHHBvN0mjZtDdCOiUFFq_XwVcQxG3hU-3XcaE] || [https://www.youtube.com/watch?v=kiYbAmd_3IA]
|-
|Week of Nov 30 || Stan Lee, Seokho Lim, Kyle Jung, Dae Hyun Kim || 27|| Improving neural networks by preventing co-adaption of feature detectors || [https://arxiv.org/pdf/1207.0580.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Improving_neural_networks_by_preventing_co-adaption_of_feature_detectors Summary] || [https://youtu.be/SV5UOM3QwiA Video]
|-
|Week of Nov 30 || Yawen Wang, Danmeng Cui, ZiJie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Describtion_of_Text_Mining Summary] ||
|-
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN Summary] || [https://youtu.be/NgcSMXNDNuU]
|-
|Week of Nov 30 || Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Research_Papers_Classification_System Summary] || [https://youtu.be/Ug-5H4B5xkQ]
|-
|Week of Nov 30 || Daniel Zhang, Jacky Yao, Scholar Sun, Russell Parco, Ian Cheung || 31 || Speech2Face: Learning the Face Behind a Voice || [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Speech2Face:_Learning_the_Face_Behind_a_Voice Summary] || [https://youtu.be/lNQAbMxOj4w]
|-
|Week of Nov 30 || Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du || 32 || Evaluating Machine Accuracy on ImageNet || [https://proceedings.icml.cc/static/paper_files/icml/2020/6173-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Evaluating_Machine_Accuracy_on_ImageNet Summary] || [https://youtu.be/jj4S3VGzQz4 Video]
|-
|Week of Nov 30 || Mushi Wang, Siyuan Qiu, Yan Yu || 33 || Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections || [https://ieeexplore.ieee.org/abstract/document/8957421 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction Summary] || [https://youtu.be/cqyn3aO_5tc Video 33]

Loss Function Search for Face Recognition

2020-12-01T03:41:20Z

Y87yu: /* Critiques */

== Presented by ==
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang

== Introduction ==
Face recognition is a technology that can label a face to a specific identity. The process involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face and another face map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness.

Hence, the paper introduced a new loss function which can reduce the softmax probability. Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:

<center><math>L_1=-log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center>

Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:

<center><math>L_2=-log\frac{e^{s cos{(\theta_{{w_y},x})}}}{e^{s cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s cos{(\theta_{{w_y},x})}}}}</math> [1] </center>

This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, they build upon the same structure as the equation above. Some of these variations will be discussed in detail in the later sections.

In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two design search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.

== Previous Work ==
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.

== Motivation ==
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions was high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed, and including margin-based formulations, they often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.

To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one parameter required and an improved search space using a reward-based method which allows the authors to determine the best option for their loss function.

== Problem Formulation ==
=== Analysis of Margin-based Softmax Loss ===
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:

<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center>
<center> where <math>a=1-e^{s{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center>

<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.

Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability.

=== Random Search ===
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:

<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center>

This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.

=== Reward-Guided Search ===
Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is:

<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center>

where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.

<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment's reaction to this action.

<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center>
<center>Figure 1: Reinforcement Learning scenario [4]</center>

The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5].

In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. <math>\mu</math> updates after each epoch from the reward function.

<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center>

=== Optimization ===
Calculating the reward involves a standard bi-level optimization problem, which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:

<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center>
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center>

In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.

== Results and Discussion ==
=== Data Preprocessing ===
The training datasets consisted of cleaned versions of CASIA-WebFace and MS-Celeb-1M-v1c to remove the impact of noisy labels in the original sets. Furthermore, there were a total of 15,414 identities that overlapped between the testing and training datasets. These were removed from the training sets.

=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms, however, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.

Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boost the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discimination power. Also the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation. The following table gives a summary of the performance of each model.

<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center>

<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center>

=== Results on RFW ===
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3.

<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center>
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center>

<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center>
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center>

=== Results on MegaFace and Trillion-Pairs ===
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e-3</math> on MegaFace, the identification TPR@FAR = <math>1e-6</math> and the verification TPR@FAR = <math>1e-9</math> on Trillion-Pairs are reported on Table 4 and 5.

On the test sets MegaFace and Trillion-Pairs, Search-softmax achieves the best performance over all other alternative methods. On MegaFace, Search-softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to new designed search space.

<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center>
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center>

<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center>
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center>

From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a same trend on Trillion-Pairs where Search-softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.

<center>[[Image:G25_Figure2_left.png|800px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|800px |alt=Alt text|Title text]]</center>
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center>

== Conclusion ==
In this paper, it is discussed that in order to enhance feature discrimination for face recognition, it is key to know how to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets. While these developed methods were generally more effective in increasing accuracy versus previous methods, there is very little difference between the two. It can be seen that Search-Softmax performs slightly better than Random-Softmax most of the time.

== Critiques ==
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper], written by Tae-Hyun Oh et al., they also discuss an optimal loss function for face recognition. However, since in the other paper, they were doing face recognition from voice audio, the loss function used was slightly different than the ones discussed in this paper.
* This model has many applications such as identifying disguised prisoners for police. But we need to do a good data preprocessing otherwise we might not get a good predicted result. But authors did not mention about the data preprocessing which is a key part of this model.
* It will be better if we can know what kind of noises was removed in the clean version. Also, simply removing the overlapping data is wasteful. It would be better to just put them into one of the train and test samples.
* This paper indicate that the new searching method and loss function have induced more effective face recognition result than other six methods. But there is no mention of the increase or decrease in computational efficiency since only very little difference exist between those methods and the real time evaluation is often required at the face recognition application level.

== References ==
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.

[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.
2020].

[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020].

[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].

[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].

Loss Function Search for Face Recognition

2020-12-01T03:40:10Z

Y87yu: /* Critiques */

== Presented by ==
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang

== Introduction ==
Face recognition is a technology that can label a face to a specific identity. The process involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face and another face map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness.

Hence, the paper introduced a new loss function which can reduce the softmax probability. Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:

<center><math>L_1=-log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center>

Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:

<center><math>L_2=-log\frac{e^{s cos{(\theta_{{w_y},x})}}}{e^{s cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s cos{(\theta_{{w_y},x})}}}}</math> [1] </center>

This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, they build upon the same structure as the equation above. Some of these variations will be discussed in detail in the later sections.

In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two design search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.

== Previous Work ==
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.

== Motivation ==
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions was high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed, and including margin-based formulations, they often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.

To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one parameter required and an improved search space using a reward-based method which allows the authors to determine the best option for their loss function.

== Problem Formulation ==
=== Analysis of Margin-based Softmax Loss ===
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:

<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center>
<center> where <math>a=1-e^{s{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center>

<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.

Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability.

=== Random Search ===
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:

<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center>

This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.

=== Reward-Guided Search ===
Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is:

<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center>

where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.

<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment's reaction to this action.

<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center>
<center>Figure 1: Reinforcement Learning scenario [4]</center>

The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5].

In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. <math>\mu</math> updates after each epoch from the reward function.

<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center>

=== Optimization ===
Calculating the reward involves a standard bi-level optimization problem, which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:

<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center>
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center>

In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.

== Results and Discussion ==
=== Data Preprocessing ===
The training datasets consisted of cleaned versions of CASIA-WebFace and MS-Celeb-1M-v1c to remove the impact of noisy labels in the original sets. Furthermore, there were a total of 15,414 identities that overlapped between the testing and training datasets. These were removed from the training sets.

=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms, however, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.

Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boost the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discimination power. Also the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation. The following table gives a summary of the performance of each model.

<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center>

<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center>

=== Results on RFW ===
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3.

<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center>
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center>

<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center>
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center>

=== Results on MegaFace and Trillion-Pairs ===
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e-3</math> on MegaFace, the identification TPR@FAR = <math>1e-6</math> and the verification TPR@FAR = <math>1e-9</math> on Trillion-Pairs are reported on Table 4 and 5.

On the test sets MegaFace and Trillion-Pairs, Search-softmax achieves the best performance over all other alternative methods. On MegaFace, Search-softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to new designed search space.

<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center>
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center>

<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center>
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center>

From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a same trend on Trillion-Pairs where Search-softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.

<center>[[Image:G25_Figure2_left.png|800px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|800px |alt=Alt text|Title text]]</center>
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center>

== Conclusion ==
In this paper, it is discussed that in order to enhance feature discrimination for face recognition, it is key to know how to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets. While these developed methods were generally more effective in increasing accuracy versus previous methods, there is very little difference between the two. It can be seen that Search-Softmax performs slightly better than Random-Softmax most of the time.

== Critiques ==
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper], written by Tae-Hyun Oh et al., they also discuss an optimal loss function for face recognition. However, since in the other paper, they were doing face recognition from voice audio, the loss function used was slightly different than the ones discussed in this paper.
* This model has many applications such as identifying disguised prisoners for police. But we need to do a good data preprocessing otherwise we might not get a good predicted result. But authors did not mention about the data preprocessing which is a key part of this model.
* It will be better if we can know what kind of noises was removed in the clean version. Also, simply removing the overlapping data is wasteful. It would be better to just put them into one of the train and test samples.
* This paper indicate that the new searching method and loss function have induced more effective face recognition result than other six methods. But there is no mention of the increase or decrease in computational efficiency since only very little difference exist between those methods and real time evaluation is often required at the face recognition application level.

== References ==
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.

[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.
2020].

[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020].

[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].

[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].

Loss Function Search for Face Recognition

2020-12-01T03:31:52Z

Y87yu: /* Results on MegaFace and Trillion-Pairs */

== Presented by ==
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang

== Introduction ==
Face recognition is a technology that can label a face to a specific identity. The process involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face and another face map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness.

Hence, the paper introduced a new loss function which can reduce the softmax probability. Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:

<center><math>L_1=-log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center>

Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:

<center><math>L_2=-log\frac{e^{s cos{(\theta_{{w_y},x})}}}{e^{s cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s cos{(\theta_{{w_y},x})}}}}</math> [1] </center>

This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, they build upon the same structure as the equation above. Some of these variations will be discussed in detail in the later sections.

In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two design search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.

== Previous Work ==
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.

== Motivation ==
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions was high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed, and including margin-based formulations, they often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.

To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one parameter required and an improved search space using a reward-based method which allows the authors to determine the best option for their loss function.

== Problem Formulation ==
=== Analysis of Margin-based Softmax Loss ===
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:

<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center>
<center> where <math>a=1-e^{s{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center>

<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.

Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability.

=== Random Search ===
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:

<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center>

This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.

=== Reward-Guided Search ===
Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is:

<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center>

where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.

<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment's reaction to this action.

<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center>
<center>Figure 1: Reinforcement Learning scenario [4]</center>

The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5].

In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. <math>\mu</math> updates after each epoch from the reward function.

<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center>

=== Optimization ===
Calculating the reward involves a standard bi-level optimization problem, which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:

<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center>
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center>

In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.

== Results and Discussion ==
=== Data Preprocessing ===
The training datasets consisted of cleaned versions of CASIA-WebFace and MS-Celeb-1M-v1c to remove the impact of noisy labels in the original sets. Furthermore, there were a total of 15,414 identities that overlapped between the testing and training datasets. These were removed from the training sets.

=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms, however, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.

Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boost the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discimination power. Also the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation. The following table gives a summary of the performance of each model.

<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center>

<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center>

=== Results on RFW ===
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3.

<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center>
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center>

<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center>
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center>

=== Results on MegaFace and Trillion-Pairs ===
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e-3</math> on MegaFace, the identification TPR@FAR = <math>1e-6</math> and the verification TPR@FAR = <math>1e-9</math> on Trillion-Pairs are reported on Table 4 and 5.

On the test sets MegaFace and Trillion-Pairs, Search-softmax achieves the best performance over all other alternative methods. On MegaFace, Search-softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to new designed search space.

<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center>
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center>

<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center>
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center>

From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a same trend on Trillion-Pairs where Search-softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.

<center>[[Image:G25_Figure2_left.png|800px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|800px |alt=Alt text|Title text]]</center>
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center>

== Conclusion ==
In this paper, it is discussed that in order to enhance feature discrimination for face recognition, it is key to know how to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets. While these developed methods were generally more effective in increasing accuracy versus previous methods, there is very little difference between the two. It can be seen that Search-Softmax performs slightly better than Random-Softmax most of the time.

== Critiques ==
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper], written by Tae-Hyun Oh et al., they also discuss an optimal loss function for face recognition. However, since in the other paper, they were doing face recognition from voice audio, the loss function used was slightly different than the ones discussed in this paper.
* This model has many applications such as identifying disguised prisoners for police. But we need to do a good data preprocessing otherwise we might not get a good predicted result. But authors did not mention about the data preprocessing which is a key part of this model.
* It will be better if we can know what kind of noises was removed in the clean version. Also, simply removing the overlapping data is wasteful. It would be better to just put them into one of the train and test samples.

== References ==
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.

[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.
2020].

[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020].

[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].

[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].

Research Papers Classification System

2020-12-01T03:28:02Z

Y87yu: /* Critique */

= Presented by =
Jill Wang, Junyi (Jay) Yang, Yu Min (Chris) Wu, Chun Kit (Calvin) Li

= Introduction =
This paper introduces a paper classification system that utilizes the Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and K-means clustering. The most important technology the system used to process big data is the Hadoop Distributed File Systems (HDFS). The system can handle quantitatively complex research paper classification problems efficiently and accurately.

===General Framework===

The paper classification system classifies research papers based on the abstracts given that the core of most papers is presented in the abstracts.

<ol><li>Paper Crawling
<p>Collects abstracts from research papers published during a given period</p></li>
<li>Preprocessing
<p> <ol style="list-style-type:lower-alpha"><li>Removes stop words in the papers crawled, in which only nouns are extracted from the papers</li>
<li>generates a keyword dictionary, keeping only the top-N keywords with the highest frequencies</li> </ol>
</p></li>
<li>Topic Modelling
<p> Use the LDA to group the keywords into topics</p>
</li>
<li>Paper Length Calculation
<p> Calculates the total number of occurrences of words to prevent an unbalanced TF values caused by the various length of abstracts using the map-reduce algorithm</p>
</li>
<li>Word Frequency Calculation
<p> Calculates the Term Frequency (TF) values which represent the frequency of keywords in a research paper</p>
</li>
<li>Document Frequency Calculation
<p> Calculates the Document Frequency (DF) values which represents the frequency of keywords in a collection of research papers. The higher the DF value, the lower the importance of a keyword.</p>
</li>
<li>TF-IDF calculation
<p> Calculates the inverse of the DF which represents the importance of a keyword.</p>
</li>
<li>Paper Classification
<p> Classify papers by topics using the K-means clustering algorithm.</p>
</li>
</ol>

===Technologies===

The HDFS with a Hadoop cluster composed of one master node, one sub node, and four data nodes is what is used to process the massive paper data. Hadoop-2.6.5 version in Java is what is used to perform the TF-IDF calculation. Spark MLlib is what is used to perform the LDA. The Scikit-learn library is what is used to perform the K-means clustering.

===HDFS===

Hadoop Distributed File Systems was used to process big data in this system. What Hadoop does is to break a big collection of data into different partitions and pass each partition to one individual processor. Each processor will only have information about the partition of data it received.

'''In this summary, we are going to focus on introducing the main algorithms of what this system uses, namely LDA, TF-IDF, and K-Means.'''

=Data Preprocessing=
===Crawling of Abstract Data===

Under the assumption that audiences tend to first read the abstract of a paper to gain an overall understanding of the material, it is reasonable to assume the abstract section includes “core words” that can be used to effectively classify a paper's subject.

An abstract is crawled to have its stop words removed. Stop words are words that are usually ignored by search engines, such as “the”, “a”, and etc. Afterwards, nouns are extracted, as a more condensed representation for efficient analysis.

This is managed on HDFS. The TF-IDF value of each paper is calculated through map-reduce.

===Managing Paper Data===

To construct an effective keyword dictionary using abstract data and keywords data in all of the crawled papers, the authors categorized keywords with similar meanings using a single representative keyword. The approach is called stemming, which is common in cleaning data. 1394 keyword categories are extracted, which is still too much to compute. Hence, only the top 30 keyword categories are used.

<div align="center">[[File:table_1_kswf.JPG|700px]]</div>

=Topic Modeling Using LDA=

Latent Dirichlet allocation (LDA) is a generative probabilistic model that views documents as random mixtures over latent topics. Each topic is a distribution over words, and the goal is to extract these topics from documents.

LDA estimates the topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> using Dirichlet priors for the distributions with a fixed number of topics. For each document, obtain a feature vector:

\[F = \left( P\left(z_1 | d\right), P\left(z_2 | d\right), \cdots, P\left(z_k | d\right) \right)\]

In the paper, authors extract topics from preprocessed paper to generate three kinds of topic sets, each with 10, 20, and 30 topics respectively. The following is a table of the 10 topic sets of highest frequency keywords.

<div align="center">[[File:table_2_tswtebls.JPG|700px]]</div>

===LDA Intuition===

LDA uses the Dirichlet priors of the Dirichlet distribution. The following picture illustrates 2-simplex Dirichlet distributions with different alpha values, one for each corner of the triangles.

<div align="center">[[File:dirichlet_dist.png|700px]]</div>

Simplex is a generalization of the notion of a triangle. In Dirichlet distribution, each parameter will be represented by a corner in simplex, so adding additional parameters implies increasing the dimensions of simplex. As illustrated, when alphas are smaller than 1 the distribution is dense at the corners. When the alphas are greater than 1 the distribution is dense at the centers.

The following illustration shows an example LDA with 3 topics, 4 words and 7 documents.

<div align="center">[[File:LDA_example.png|800px]]</div>

In the left diagram, there are three topics, hence it is a 2-simplex. In the right diagram there are four words, hence it is a 3-simplex. LDA essentially adjusts parameters in Dirichlet distributions and multinomial distributions (represented by the points), such that, in the left diagram, all the yellow points representing documents and, in the right diagram, all the points representing topics, are as close to a corner as possible. In other words, LDA finds topics for documents and also finds words for topics. At the end topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> are produced.

=Term Frequency Inverse Document Frequency (TF-IDF) Calculation=

TF-IDF is widely used to evaluate the importance of a set of words in the fields of information retrieval and text mining. It is a combination of term frequency (TF) and inverse document frequency (IDF). The idea behind this combination is
It evaluates the importance of a word within a document, and
It evaluates the importance of the word among the collection of all documents

The TF-IDF formula has the following form:

\[TF-IDF_{i,j} = TF_{i,j} \times IDF_{i}\]

where i stands for the <math>i^{th}</math> word and j stands for the <math>j^{th}</math> document.

===Term Frequency (TF)===

TF evaluates the percentage of a given word in a document. Thus, TF value indicates the importance of a word. The TF has a positive relation with the importance.

In this paper, we only calculate TF for words in the keyword dictionary obtained. For a given keyword i, <math>TF_{i,j}</math> is the number of times word i appears in document j divided by the total number of words in document j.

The formula for TF has the following form:

\[TF_{i,j} = \frac{n_{i,j} }{\sum_k n_{k,j} }\]

where i stands for the <math>i^{th}</math> word, j stands for the <math>j^{th}</math> document, and <math>n_{i,j}</math> stands for the number of times words i appear in document j.

Note that the denominator is the total number of words remaining in document j after crawling.

===Document Frequency (DF)===

DF evaluates the percentage of documents that contain a given word over the entire collection of documents. Thus, the higher DF value is, the less important the word is. Since DF and the importance of the word have an inverse relation, we use IDF instead of DF.

<math>DF_{i}</math> is the number of documents in the collection with word i divided by the total number of documents in the collection. The formula for DF has the following form:

\[DF_{i} = \frac{|d_k \in D: n_{i,k} > 0|}{|D|}\]

where <math>n_{i,k}</math> is the number of times word i appears in document k, |D| is the total number of documents in the collection.

===Inverse Document Frequency (IDF)===

In this paper, IDF is calculated in a log scale. Since we will receive a large number of documents, i.e, we will have a large |D|

The formula for IDF has the following form:

\[IDF_{i} = log\left(\frac{|D|}{|\{d_k \in D: n_{i,k} > 0\}|}\right)\]

As mentioned before, we will use HDFS. The actual formula applied is:

\[IDF_{i} = log\left(\frac{|D|+1}{|\{d_k \in D: n_{i,k} > 0\}|+1}\right)\]

=Paper Classification Using K-means Clustering=

The K-means clustering is an unsupervised classification algorithm that groups similar data into the same class. It is an efficient and simple method that can work with different types of data attributes and is able to handle noise and outliers.
<br>

Given a set of <math>d</math> by <math>n</math> dataset <math>\mathbf{X} = \left[ \mathbf{x}_1 \cdots \mathbf{x}_n \right]</math>, the algorithm will assign each <math>\mathbf{x}_j</math> into <math>k</math> different clusters based on the characteristics of <math>\mathbf{x}_j</math> itself.
<br>

Moreover, when assigning data into a cluster, the algorithm will also try to minimise the distances between the data and the centre of the cluster which the data belongs to. That is, k-means clustering will minimise the sum of square error:

\begin{align*}
min \sum_{i=1}^{k} \sum_{j \in C_i} ||x_j - \mu_i||^2
\end{align*}

where
<ul>
<li><math>k</math>: the number of clusters</li>
<li><math>C_i</math>: the <math>i^th</math> cluster</li>
<li><math>x_j</math>: the <math>j^th</math> data in the <math>C_i</math></li>
<li><math>mu_i</math>: the centroid of <math>C_i</math></li>
<li><math>||x_j - \mu_i||^2</math>: the Euclidean distance between <math>x_j</math> and <math>\mu_i</math></li>
</ul>
<br>

Since the goal for this paper is to classify research papers and group papers with similar topics based on keywords, the paper uses the K-means clustering algorithm. The algorithm first computes the cluster centroid for each group of papers with a specific topic. Then, it will assign a paper into a cluster based on the Euclidean distance between the cluster centroid and the paper’s TF-IDF value.
<br>

However, different values of <math>k</math> (the number of clusters) will return different clustering results. Therefore, it is important to define the number of clusters before clustering. For example, in this paper, the authors choose to use the Elbow scheme to determine the value of <math>k</math>. The Elbow scheme is a somewhat subjective way of choosing an optimal <math>k</math> that involves plotting the average of the squared distances from the cluster centers of the respective clusters (distortion) as a function of <math>k</math> and choosing a <math>k</math> at which point the decrease in distortion is outweighed by the increase in complexity. Also, to measure the performance of clustering, the authors decide to use the Silhouette scheme. The results of clustering are validated if the Silhouette scheme returns a value greater than <math>0.5</math>.

=System Testing Results=

In this paper, the dataset has 3264 research papers from the Future Generation Computer System (FGCS) journal between 1984 and 2017. For constructing keyword dictionaries for each paper, the authors have introduced three methods as shown below:

<div align="center">[[File:table_3_tmtckd.JPG|700px]]</div>

Then, the authors use the Elbow scheme to define the number of clusters for each method with different numbers of keywords before running the K-means clustering algorithm. The results are shown below:

<div align="center">[[File:table_4_nocobes.JPG|700px]]</div>

According to table 4, there is a positive correlation between the number of keywords and the number of clusters. In addition, method 3 combines the advantages for both method 1 and method 2; thus, method 3 requires the least clusters in total. On the other hand, the wrong keywords might be presented in papers; hence, it might not be possible to group papers with similar subjects correctly by using method 1 and so method 1 needs the most number of clusters in total.

Next, the Silhouette scheme had been used for measuring the performance for clustering. The average of the Silhouette values for each method with different numbers of keywords are shown below:

<div align="center">[[File:table_5_asv.JPG|700px]]</div>

Since the clustering is validated if the Silhouette’s value is greater than 0.5, for methods with 10 and 30 keywords, the K-means clustering algorithm produces good results.

To evaluate the accuracy of the classification system in this paper, the authors use the F-Score. The authors execute 5 times of experiment and use 500 randomly selected research papers for each trial. The following histogram shows the average value of F-Score for the three methods and different numbers of keywords:

<div align="center">[[File:fig_16_fsvotm.JPG|700px]]</div>

Note that “TFIDF” means method 1, “LDA” means method 2, and “TFIDF-LDA” means method 3. The number 10, 20, and 30 after each method is the number of keywords the method has used.
According to the histogram above, method 3 has the highest F-Score values than the other two methods with different numbers of keywords. Therefore, the classification system is most accurate when using method 3 as it combines the advantages for both method 1 and method 2.

=Conclusion=

This paper introduces a classification system that classifies papers into different topics by using TF-IDF and LDA scheme with K-means clustering algorithm. This system allows users to search the papers they want quickly and with the most productivity.

Furthermore, this classification system might be also used in different types of texts (e.g. documents, tweets, etc.) instead of only classifying research papers.

=Critique=

In this paper, DF values are calculated within each partition. This results that for each partition, DF value for a given word will vary and may have an inconsistent result for different partition methods. As mentioned above, there might be a divide by zero problem since some partitions do not have documents containing a given word, but this can be solved by introducing a dummy document as the authors did. Another method that might be better at solving inconsistent results and the divide by zero problems is to have all partitions to communicate with their DF value. Then pass the merged DF value to all partitions to do the final IDF and TF-IDF value. Having all partitions to communicate with the DF value will guarantee a consistent DF value across all partitions and helps avoid a divide by zero problem as words in the keyword dictionary must appear in some documents in the whole collection.

This paper treated the words in the different parts of a document equivalently, it might perform better if it gives different weights to the same word in different parts. For example, if a word appears in the title of the document, it usually shows it's a main topic of this document so we can put more weight on it to categorize.

When discussing the potential processing advantages of this classification system for other types of text samples, has the effect of processing mixed samples (text and image or text and video) taken into consideration? IF not, in terms of text classification only, does it have an overwhelming advantage over traditional classification models?

=References=

Blei DM, el. (2003). Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

Gil, JM, Kim, SW. (2019). Research paper classification systems based on TF-IDF and LDA schemes. ''Human-centric Computing and Information Sciences'', 9, 30. https://doi.org/10.1186/s13673-019-0192-7

Liu, S. (2019, January 11). Dirichlet distribution Motivating LDA. Retrieved November 2020, from https://towardsdatascience.com/dirichlet-distribution-a82ab942a879

Serrano, L. (Director). (2020, March 18). Latent Dirichlet Allocation (Part 1 of 2) [Video file]. Retrieved 2020, from https://www.youtube.com/watch?v=T05t-SqKArY

Research Papers Classification System

2020-12-01T03:22:10Z

Y87yu: /* References */

= Presented by =
Jill Wang, Junyi (Jay) Yang, Yu Min (Chris) Wu, Chun Kit (Calvin) Li

= Introduction =
This paper introduces a paper classification system that utilizes the Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and K-means clustering. The most important technology the system used to process big data is the Hadoop Distributed File Systems (HDFS). The system can handle quantitatively complex research paper classification problems efficiently and accurately.

===General Framework===

The paper classification system classifies research papers based on the abstracts given that the core of most papers is presented in the abstracts.

<ol><li>Paper Crawling
<p>Collects abstracts from research papers published during a given period</p></li>
<li>Preprocessing
<p> <ol style="list-style-type:lower-alpha"><li>Removes stop words in the papers crawled, in which only nouns are extracted from the papers</li>
<li>generates a keyword dictionary, keeping only the top-N keywords with the highest frequencies</li> </ol>
</p></li>
<li>Topic Modelling
<p> Use the LDA to group the keywords into topics</p>
</li>
<li>Paper Length Calculation
<p> Calculates the total number of occurrences of words to prevent an unbalanced TF values caused by the various length of abstracts using the map-reduce algorithm</p>
</li>
<li>Word Frequency Calculation
<p> Calculates the Term Frequency (TF) values which represent the frequency of keywords in a research paper</p>
</li>
<li>Document Frequency Calculation
<p> Calculates the Document Frequency (DF) values which represents the frequency of keywords in a collection of research papers. The higher the DF value, the lower the importance of a keyword.</p>
</li>
<li>TF-IDF calculation
<p> Calculates the inverse of the DF which represents the importance of a keyword.</p>
</li>
<li>Paper Classification
<p> Classify papers by topics using the K-means clustering algorithm.</p>
</li>
</ol>

===Technologies===

The HDFS with a Hadoop cluster composed of one master node, one sub node, and four data nodes is what is used to process the massive paper data. Hadoop-2.6.5 version in Java is what is used to perform the TF-IDF calculation. Spark MLlib is what is used to perform the LDA. The Scikit-learn library is what is used to perform the K-means clustering.

===HDFS===

Hadoop Distributed File Systems was used to process big data in this system. What Hadoop does is to break a big collection of data into different partitions and pass each partition to one individual processor. Each processor will only have information about the partition of data it received.

'''In this summary, we are going to focus on introducing the main algorithms of what this system uses, namely LDA, TF-IDF, and K-Means.'''

=Data Preprocessing=
===Crawling of Abstract Data===

Under the assumption that audiences tend to first read the abstract of a paper to gain an overall understanding of the material, it is reasonable to assume the abstract section includes “core words” that can be used to effectively classify a paper's subject.

An abstract is crawled to have its stop words removed. Stop words are words that are usually ignored by search engines, such as “the”, “a”, and etc. Afterwards, nouns are extracted, as a more condensed representation for efficient analysis.

This is managed on HDFS. The TF-IDF value of each paper is calculated through map-reduce.

===Managing Paper Data===

To construct an effective keyword dictionary using abstract data and keywords data in all of the crawled papers, the authors categorized keywords with similar meanings using a single representative keyword. The approach is called stemming, which is common in cleaning data. 1394 keyword categories are extracted, which is still too much to compute. Hence, only the top 30 keyword categories are used.

<div align="center">[[File:table_1_kswf.JPG|700px]]</div>

=Topic Modeling Using LDA=

Latent Dirichlet allocation (LDA) is a generative probabilistic model that views documents as random mixtures over latent topics. Each topic is a distribution over words, and the goal is to extract these topics from documents.

LDA estimates the topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> using Dirichlet priors for the distributions with a fixed number of topics. For each document, obtain a feature vector:

\[F = \left( P\left(z_1 | d\right), P\left(z_2 | d\right), \cdots, P\left(z_k | d\right) \right)\]

In the paper, authors extract topics from preprocessed paper to generate three kinds of topic sets, each with 10, 20, and 30 topics respectively. The following is a table of the 10 topic sets of highest frequency keywords.

<div align="center">[[File:table_2_tswtebls.JPG|700px]]</div>

===LDA Intuition===

LDA uses the Dirichlet priors of the Dirichlet distribution. The following picture illustrates 2-simplex Dirichlet distributions with different alpha values, one for each corner of the triangles.

<div align="center">[[File:dirichlet_dist.png|700px]]</div>

Simplex is a generalization of the notion of a triangle. In Dirichlet distribution, each parameter will be represented by a corner in simplex, so adding additional parameters implies increasing the dimensions of simplex. As illustrated, when alphas are smaller than 1 the distribution is dense at the corners. When the alphas are greater than 1 the distribution is dense at the centers.

The following illustration shows an example LDA with 3 topics, 4 words and 7 documents.

<div align="center">[[File:LDA_example.png|800px]]</div>

In the left diagram, there are three topics, hence it is a 2-simplex. In the right diagram there are four words, hence it is a 3-simplex. LDA essentially adjusts parameters in Dirichlet distributions and multinomial distributions (represented by the points), such that, in the left diagram, all the yellow points representing documents and, in the right diagram, all the points representing topics, are as close to a corner as possible. In other words, LDA finds topics for documents and also finds words for topics. At the end topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> are produced.

=Term Frequency Inverse Document Frequency (TF-IDF) Calculation=

TF-IDF is widely used to evaluate the importance of a set of words in the fields of information retrieval and text mining. It is a combination of term frequency (TF) and inverse document frequency (IDF). The idea behind this combination is
It evaluates the importance of a word within a document, and
It evaluates the importance of the word among the collection of all documents

The TF-IDF formula has the following form:

\[TF-IDF_{i,j} = TF_{i,j} \times IDF_{i}\]

where i stands for the <math>i^{th}</math> word and j stands for the <math>j^{th}</math> document.

===Term Frequency (TF)===

TF evaluates the percentage of a given word in a document. Thus, TF value indicates the importance of a word. The TF has a positive relation with the importance.

In this paper, we only calculate TF for words in the keyword dictionary obtained. For a given keyword i, <math>TF_{i,j}</math> is the number of times word i appears in document j divided by the total number of words in document j.

The formula for TF has the following form:

\[TF_{i,j} = \frac{n_{i,j} }{\sum_k n_{k,j} }\]

where i stands for the <math>i^{th}</math> word, j stands for the <math>j^{th}</math> document, and <math>n_{i,j}</math> stands for the number of times words i appear in document j.

Note that the denominator is the total number of words remaining in document j after crawling.

===Document Frequency (DF)===

DF evaluates the percentage of documents that contain a given word over the entire collection of documents. Thus, the higher DF value is, the less important the word is. Since DF and the importance of the word have an inverse relation, we use IDF instead of DF.

<math>DF_{i}</math> is the number of documents in the collection with word i divided by the total number of documents in the collection. The formula for DF has the following form:

\[DF_{i} = \frac{|d_k \in D: n_{i,k} > 0|}{|D|}\]

where <math>n_{i,k}</math> is the number of times word i appears in document k, |D| is the total number of documents in the collection.

===Inverse Document Frequency (IDF)===

In this paper, IDF is calculated in a log scale. Since we will receive a large number of documents, i.e, we will have a large |D|

The formula for IDF has the following form:

\[IDF_{i} = log\left(\frac{|D|}{|\{d_k \in D: n_{i,k} > 0\}|}\right)\]

As mentioned before, we will use HDFS. The actual formula applied is:

\[IDF_{i} = log\left(\frac{|D|+1}{|\{d_k \in D: n_{i,k} > 0\}|+1}\right)\]

=Paper Classification Using K-means Clustering=

The K-means clustering is an unsupervised classification algorithm that groups similar data into the same class. It is an efficient and simple method that can work with different types of data attributes and is able to handle noise and outliers.
<br>

Given a set of <math>d</math> by <math>n</math> dataset <math>\mathbf{X} = \left[ \mathbf{x}_1 \cdots \mathbf{x}_n \right]</math>, the algorithm will assign each <math>\mathbf{x}_j</math> into <math>k</math> different clusters based on the characteristics of <math>\mathbf{x}_j</math> itself.
<br>

Moreover, when assigning data into a cluster, the algorithm will also try to minimise the distances between the data and the centre of the cluster which the data belongs to. That is, k-means clustering will minimise the sum of square error:

\begin{align*}
min \sum_{i=1}^{k} \sum_{j \in C_i} ||x_j - \mu_i||^2
\end{align*}

where
<ul>
<li><math>k</math>: the number of clusters</li>
<li><math>C_i</math>: the <math>i^th</math> cluster</li>
<li><math>x_j</math>: the <math>j^th</math> data in the <math>C_i</math></li>
<li><math>mu_i</math>: the centroid of <math>C_i</math></li>
<li><math>||x_j - \mu_i||^2</math>: the Euclidean distance between <math>x_j</math> and <math>\mu_i</math></li>
</ul>
<br>

Since the goal for this paper is to classify research papers and group papers with similar topics based on keywords, the paper uses the K-means clustering algorithm. The algorithm first computes the cluster centroid for each group of papers with a specific topic. Then, it will assign a paper into a cluster based on the Euclidean distance between the cluster centroid and the paper’s TF-IDF value.
<br>

However, different values of <math>k</math> (the number of clusters) will return different clustering results. Therefore, it is important to define the number of clusters before clustering. For example, in this paper, the authors choose to use the Elbow scheme to determine the value of <math>k</math>. The Elbow scheme is a somewhat subjective way of choosing an optimal <math>k</math> that involves plotting the average of the squared distances from the cluster centers of the respective clusters (distortion) as a function of <math>k</math> and choosing a <math>k</math> at which point the decrease in distortion is outweighed by the increase in complexity. Also, to measure the performance of clustering, the authors decide to use the Silhouette scheme. The results of clustering are validated if the Silhouette scheme returns a value greater than <math>0.5</math>.

=System Testing Results=

In this paper, the dataset has 3264 research papers from the Future Generation Computer System (FGCS) journal between 1984 and 2017. For constructing keyword dictionaries for each paper, the authors have introduced three methods as shown below:

<div align="center">[[File:table_3_tmtckd.JPG|700px]]</div>

Then, the authors use the Elbow scheme to define the number of clusters for each method with different numbers of keywords before running the K-means clustering algorithm. The results are shown below:

<div align="center">[[File:table_4_nocobes.JPG|700px]]</div>

According to table 4, there is a positive correlation between the number of keywords and the number of clusters. In addition, method 3 combines the advantages for both method 1 and method 2; thus, method 3 requires the least clusters in total. On the other hand, the wrong keywords might be presented in papers; hence, it might not be possible to group papers with similar subjects correctly by using method 1 and so method 1 needs the most number of clusters in total.

Next, the Silhouette scheme had been used for measuring the performance for clustering. The average of the Silhouette values for each method with different numbers of keywords are shown below:

<div align="center">[[File:table_5_asv.JPG|700px]]</div>

Since the clustering is validated if the Silhouette’s value is greater than 0.5, for methods with 10 and 30 keywords, the K-means clustering algorithm produces good results.

To evaluate the accuracy of the classification system in this paper, the authors use the F-Score. The authors execute 5 times of experiment and use 500 randomly selected research papers for each trial. The following histogram shows the average value of F-Score for the three methods and different numbers of keywords:

<div align="center">[[File:fig_16_fsvotm.JPG|700px]]</div>

Note that “TFIDF” means method 1, “LDA” means method 2, and “TFIDF-LDA” means method 3. The number 10, 20, and 30 after each method is the number of keywords the method has used.
According to the histogram above, method 3 has the highest F-Score values than the other two methods with different numbers of keywords. Therefore, the classification system is most accurate when using method 3 as it combines the advantages for both method 1 and method 2.

=Conclusion=

This paper introduces a classification system that classifies papers into different topics by using TF-IDF and LDA scheme with K-means clustering algorithm. This system allows users to search the papers they want quickly and with the most productivity.

Furthermore, this classification system might be also used in different types of texts (e.g. documents, tweets, etc.) instead of only classifying research papers.

=Critique=

In this paper, DF values are calculated within each partition. This results that for each partition, DF value for a given word will vary and may have an inconsistent result for different partition methods. As mentioned above, there might be a divide by zero problem since some partitions do not have documents containing a given word, but this can be solved by introducing a dummy document as the authors did. Another method that might be better at solving inconsistent results and the divide by zero problems is to have all partitions to communicate with their DF value. Then pass the merged DF value to all partitions to do the final IDF and TF-IDF value. Having all partitions to communicate with the DF value will guarantee a consistent DF value across all partitions and helps avoid a divide by zero problem as words in the keyword dictionary must appear in some documents in the whole collection.

This paper treated the words in the different parts of a document equivalently, it might perform better if it gives different weights to the same word in different parts. For example, if a word appears in the title of the document, it usually shows it's a main topic of this document so we can put more weight on it to categorize.

=References=

Blei DM, el. (2003). Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

Gil, JM, Kim, SW. (2019). Research paper classification systems based on TF-IDF and LDA schemes. ''Human-centric Computing and Information Sciences'', 9, 30. https://doi.org/10.1186/s13673-019-0192-7

Liu, S. (2019, January 11). Dirichlet distribution Motivating LDA. Retrieved November 2020, from https://towardsdatascience.com/dirichlet-distribution-a82ab942a879

Serrano, L. (Director). (2020, March 18). Latent Dirichlet Allocation (Part 1 of 2) [Video file]. Retrieved 2020, from https://www.youtube.com/watch?v=T05t-SqKArY

Improving neural networks by preventing co-adaption of feature detectors

2020-12-01T03:18:23Z

Y87yu:

== Presented by ==
Stan Lee, Seokho Lim, Kyle Jung, Dae Hyun Kim

= Introduction =
In this paper, Hinton et al. introduces a novel way to improve neural networks’ performance. By omitting neurons in hidden layers with a probability of 0.5, each hidden unit is prevented from relying on other hidden units being present during training. Hence there are fewer co-adaptations among them on the training data. Called “dropout,” this process is also an efficient alternative to training many separate networks and average their predictions on the test set.
They used the standard, stochastic gradient descent algorithm and separated training data into mini-batches. An upper bound was set on the L2 norm of incoming weight vector for each hidden neuron, which was normalized if its size exceeds the bound. They found that using a constraint, instead of a penalty, forced model to do a more thorough search of the weight-space, when coupled with the very large learning rate that decays during training.
Their dropout models included all of the hidden neurons, and their outgoing weights were halved to account for the chances of omission. The models were shown to result in lower test error rates on several datasets: MNIST; TIMIT; Reuters Corpus Volume; CIFAR-10; and ImageNet.

= MNIST =
The MNIST dataset contains 70,000 digit images of size 28 x 28. To see the impact of dropout, they used 4 different neural networks (784-800-800-10, 784-1200-1200-10, 784-2000-2000-10, 784-1200-1200-1200-10), using the same dropout rates as 50% for hidden neurons and 20% for visible neurons. Stochastic gradient descent was used with mini-batches of size 100 and a cross-entropy objective function as the loss function. Weights were updated after each minibatch, and training was done for 3000 epochs. An exponentially decaying learning rate <math>\epsilon</math> was used, with the initial value set as 10.0, and it was multiplied by the decaying factor <math>f</math> = 0.998 at the end of each epoch. At each hidden layer, the incoming weight vector for each hidden neuron was set an upper bound of its length, <math>l</math>, and they found from cross-validation that the results were the best when <math>l</math> = 15. Initial weights values were pooled from a normal distribution with mean 0 and standard deviation of 0.01. To update weights, an additional variable, ''p'', called momentum, was used to accelerate learning. The initial value of <math>p</math> was 0.5, and it increased linearly to the final value 0.99 during the first 500 epochs, remaining unchanged after. Also, when updating weights, the learning rate was multiplied by <math>1 – p</math>. <math>L</math> denotes the gradient of loss function.

[[File:weights_mnist2.png|center|400px]]

The best published result for a standard feedforward neural network was 160 errors. This was reduced to about 130 errors with 0.5 dropout and different L2 constraints for each hidden unit input weight. By omitting a random 20% of the input pixels in addition to the aforementioned changes, the number of errors was further reduced to 110. The following figure visualizes the result.
[[File:mnist_figure.png|center|500px]]
A publicly available pre-trained deep belief net resulted in 118 errors, and it was reduced to 92 errors when the model was fine-tuned with dropout. Another publicly available model was a deep Boltzmann machine, and it resulted in 103, 97, 94, 93 and 88 when the model was fine-tuned using standard backpropagation and was unrolled. They were reduced to 83, 79, 78, 78, and 77 when the model was fine-tuned with dropout – the mean of 79 errors was a record for models that do not use prior knowledge or enhanced training sets.

= TIMIT =

TIMIT dataset includes voice samples of 630 American English speakers varying across 8 different dialects. It is often used to evaluate the performance of automatic speech recognition systems. Using Kaldi, the dataset was pre-processed to extract input features in the form of log filter bank responses.

=== Pre-training and Training ===

For pretraining, they pretrained their neural network with a deep belief network and the first layer was built using Restricted Boltzmann Machine (RBM). Initializing visible biases with zero, weights were sampled from random numbers that followed normal distribution <math>N(0, 0.01)</math>. Each visible neuron’s variance was set to 1.0 and remained unchanged.

Minimizing Contrastive Divergence (CD) was used to facilitate learning. Since momentum is used to speed up learning, it was initially set to 0.5 and increased linearly to 0.9 over 20 epochs. The average gradient had 0.001 of a learning rate which was then multiplied by <math>(1-momentum)</math> and L2 weight decay was set to 0.001. After setting up the hyperparameters, the model was done training after 100 epochs. Binary RBMs were used for training all subsequent layers with a learning rate of 0.01. Then, <math>p</math> was set as the mean activation of a neuron in the data set and the visible bias of each neuron was initialized to <math>log(p/(1 − p))</math>. Training each layer with 50 epochs, all remaining hyper-parameters were the same as those for the Gaussian RBM.

=== Dropout tuning ===

The initial weights were set in a neural network from the pretrained RBMs. To finetune the network with dropout-backpropagation, momentum was initially set to 0.5 and increased linearly up to 0.9 over 10 epochs. The model had a small constant learning rate of 1.0 and it was used to apply to the average gradient on a minibatch. The model also retained all other hyperparameters the same as the model from MNIST dropout finetuning. The model required approximately 200 epochs to converge. For comparison purpose, they also finetuned the same network with standard backpropagation with a learning rate of 0.1 with the same hyperparameters.

=== Classification Test and Performance ===

A Neural network was constructed to output the classification error rate on the test set of TIMIT dataset. They have built the neural network with four fully-connected hidden layers with 4000 neurons per layer. The output layer distinguishes distinct classes from 185 softmax output neurons that are merged into 39 classes. After constructing the neural network, 21 adjacent frames with an advance of 10ms per frame was given as an input.

Comparing the performance of dropout with standard backpropagation on several network architectures and input representations, dropout consistently achieved lower error and cross-entropy. Results showed that it significantly controls overfitting, making the method robust to choices of network architecture. It also allowed much larger nets to be trained and removed the need for early stopping. Thus, neural network architectures with dropout are not very sensitive to the choice of learning rate and momentum.

= Reuters Corpus Volume =
Reuters Corpus Volume I archives 804,414 news documents that belong to 103 topics. Under four major themes - corporate/industrial, economics, government/social, and markets – they belonged to 63 classes. After removing 11 classes with no data and one class with insufficient data, they are left with 50 classes and 402,738 documents. The documents were divided into training and test sets equally and randomly, with each document representing the 2000 most frequent words in the dataset, excluding stopwords.

They trained two neural networks, with size 2000-2000-1000-50, one using dropout and backpropagation, and the other using standard backpropagation. The training hyperparameters are the same as that in MNIST, but training was done for 500 epochs.

In the following figure, we see the significant improvements by the model with dropout in the test set error. On the right side, we see that learning with dropout also proceeds smoother.

[[File:reuters_figure.png|700px|center]]

= CNN =

Feed-forward neural networks consist of several layers of neurons where each neuron in a layer applies a linear filter to the input image data and is passed on to the neurons in the next layer. When calculating the neuron’s output, scalar bias a.k.a weights is applied to the filter with nonlinear activation function as parameters of the network that are learned by training data. [[File:cnnbigpicture.jpeg|thumb|upright=2|center|alt=text|Figure: Overview of Convolutional Neural Network]] There are several differences between Convolutional Neural networks and ordinary neural networks. The figure above gives a visual representation of a Convolutional Neural Network. First, CNN’s neurons are organized topographically into a bank and laid out on a 2D grid, so it reflects the organization of dimensions of the input data. Secondly, neurons in CNN apply filters which are local, and which are centered at the neuron’s location in the topographic organization. Meaning that useful metrics or clues to identify the object in an input image which can be found by examining local neighborhoods of the image. Next, all neurons in a bank apply the same filter at different locations in the input image. When looking at the image example, green is an input to one neuron bank, yellow is filter bank, and pink is the output of one neuron bank (convolved feature). A bank of neurons in a CNN applies a convolution operation, aka filters, to its input where a single layer in a CNN typically has multiple banks of neurons, each performing a convolution with a different filter. The resulting neuron banks become distinct input channels into the next layer. The whole process reduces the net’s representational capacity, but also reduces the capacity to overfit.
[[File:bankofneurons.gif|thumb|upright=3|center|alt=text|Figure: Bank of neurons]]

=== Pooling ===

Pooling layer summarizes the activities of local patches of neurons in the convolutional layer by subsampling the output of a convolutional layer. Pooling is useful for extracting dominant features, to decrease the computational power required to process the data through dimensionality reduction. The procedure of pooling goes on like this; output from convolutional layers is divided into sections called pooling units and they are laid out topographically, connected to a local neighborhood of other pooling units from the same convolutional output. Then, each pooling unit is computed with some function which could be maximum and average. Maximum pooling returns the maximum value from the section of the image covered by the pooling unit while average pooling returns the average of all the values inside the pooling unit (see example). In result, there are fewer total pooling units than convolutional unit outputs from the previous layer, this is due to larger spacing between pixels on pooling layers. Using the max-pooling function reduces the effect of outliers and improves generalization.
[[File:maxandavgpooling.jpeg|thumb|upright=2|center|alt=text|Figure: Max pooling and Average pooling]]

=== Local Response Normalization ===

This network includes local response normalization layers which are implemented in lateral form and used on neurons with unbounded activations and permits the detection of high-frequency features with a big neuron response. This regularizer encourages competition among neurons belonging to different banks. Normalization is done by dividing the activity of a neuron in bank <math>i</math> at position <math>(x,y)</math> by the equation:
[[File:local response norm.png|upright=2|center|]] where the sum runs over <math>N</math> ‘adjacent’ banks of neurons at the same position as in the topographic organization of neuron bank. The constants, <math>N</math>, <math>alpha</math> and <math>betas</math> are hyper-parameters whose values are determined using a validation set. This technique is replaced by better techniques such as the combination of dropout and regularization methods (<math>L1</math> and <math>L2</math>)

=== Neuron nonlinearities ===

All of the neurons for this model use the max-with-zero nonlinearity where output within a neuron is computed as <math> a^{i}_{x,y} = max(0, z^i_{x,y})</math> where <math> z^i_{x,y} </math> is the total input to the neuron. The reason they use nonlinearity is because it has several advantages over traditional saturating neuron models, such as significant reduction in training time required to reach a certain error rate. Another advantage is that nonlinearity reduces the need for contrast-normalization and data pre-processing since neurons do not saturate- meaning activities simply scale up little by little with usually large input values. For this model’s only pre-processing step, they subtract the mean activity from each pixel and the result is a centered data.

=== Objective function ===

The objective function of their network maximizes the multinomial logistic regression objective which is the same as minimizing the average cross-entropy across training cases between the true label and the model’s predicted label.

=== Weight Initialization ===

It’s important to note that if a neuron always receives a negative value during training, it will not learn because its output is uniformly zero under the max-with-zero nonlinearity. Hence, the weights in their model were sampled from a zero-mean normal distribution with a high enough variance. High variance in weights will set a certain number of neurons with positive values for learning to happen, and in practice, it’s necessary to try out several candidates for variances until a working initialization is found. In their experiment, setting a positive constant, or 1, as biases of the neurons in the hidden layers was helpful in finding it.

=== Training ===

In this model, a batch size of 128 samples and momentum of 0.9, we train our model using stochastic gradient descent. The update rule for weight <math>w</math> is $$ v_{i+1} = 0.9v_i + \epsilon <\frac{dE}{dw_i}> i$$ $$w_{i+1} = w_i + v_{i+1} $$ where <math>i</math> is the iteration index, <math>v</math> is a momentum variable, <math>\epsilon</math> is the learning rate and <math>\frac{dE}{dw}</math> is the average over the <math>i</math>th batch of the derivative of the objective with respect to <math>w_i</math>. The whole training process on CIFAR-10 takes roughly 90 minutes and ImageNet takes 4 days with dropout and two days without.

=== Learning ===
To determine the learning rate for the network, it is a must to start with an equal learning rate for each layer which produces the largest reduction in the objective function with power of ten. Usually, it is in the order of <math>10^{-2}</math> or <math>10^{-3}</math>. In this case, they reduce the learning rate twice by a factor of ten before termination of training.

= CIFAR-10 =

=== CIFAR-10 Dataset ===

Removing incorrect labels, The CIFAR-10 dataset is a subset of the Tiny Images dataset with 10 classes. It contains 5000 training images and 1000 testing images for each class. The dataset has 32 x 32 color images searched from the web and the images are labeled with the noun used to search the image.

[[File:CIFAR-10.png|thumb|upright=2|center|alt=text|Figure 4: CIFAR-10 Sample Dataset]]

=== Models for CIFAR-10 ===

Two models, one with dropout and one without dropout, were built to test the performance of dropout on CIFAR-10. All models have CNN with three convolutional layers each with a pooling layer. The max-pooling method is performed by the pooling layer which follows the first convolutional layer, and the average-pooling method is performed by remaining 2 pooling layers. The first and second pooling layers with <math>N = 9, α = 0.001</math>, and <math>β = 0.75</math> are followed by response normalization layers. A ten-unit softmax layer, which is used to output a probability distribution over class labels, is connected with the upper-most pooling layer. Using filter size of 5×5, all convolutional layers have 64 filter banks.

Additional changes were made with the model with dropout. The model with dropout enables us to use more parameters because dropout forces a strong regularization on the network. Thus, a fourth weight layer is added to take the input from the previous pooling layer. This fourth weight layer is locally connected, but not convolutional, and contains 16 banks of filters of size 3 × 3 with 50% dropout. Lastly, the softmax layer takes its input from this fourth weight layer.

Thus, with a neural network with 3 convolutional hidden layers with 3 max-pooling layers, the classification error achieved 16.6% to beat 18.5% from the best published error rate without using transformed data. The model with one additional locally-connected layer and dropout at the last hidden layer produced the error rate of 15.6%.

= ImageNet =

===ImageNet Dataset===

ImageNet is a dataset of millions of high-resolution images, and they are labeled among 1000 different categories. The data were collected from the web and manually labeled using MTerk tool, which is a crowd-sourcing tool provided by Amazon.
Because this dataset has millions of labeled images in thousands of categories, it is very difficult to have perfect accuracy on this dataset even for humans because the ImageNet images may contain multiple objects and there are a large number of object classes. ImageNet and CIFAR-10 are very similar, but the scale of ImageNet is about 20 times bigger (1,300,000 vs 60,000). The size of ImageNet is about 1.3 million training images, 50,000 validation images, and 150,000 testing images. They used resized images of 256 x 256 pixels for their experiments.

'''An ambiguous example to classify:'''

[[File:imagenet1.png|200px|center]]

When this paper was written, the best score on this dataset was the error rate of 45.7% by High-dimensional signature compression for large-scale image classification (J. Sanchez, F. Perronnin, CVPR11 (2011)). The authors of this paper could achieve a comparable performance of 48.6% error rate using a single neural network with five convolutional hidden layers with a max-pooling layer in between, followed by two globally connected layers and a final 1000-way softmax layer. When applying 50% dropout to the 6th layer, the error rate was brought down to 42.4%.

'''ImageNet Dataset:'''

[[File:imagenet2.png|400px|center]]

===Models for ImageNet===

They mostly focused on the model with dropout because the one without dropout had a similar approach, but there was a serious issue with overfitting. They used a convolutional neural network trained by 224×224 patches randomly extracted from the 256 × 256 images. This could reduce the network’s capacity to overfit the training data and helped generalization as a form of data augmentation. The method of averaging the prediction of the net on ten 224 × 224 patches of the 256 × 256 input image was used for testing their model patched at the center, four corners, and their horizontal reflections. To maximize the performance on the validation set, this complicated network architecture was used and it was found that dropout was very effective. Also, it was demonstrated that using non-convolutional higher layers with the number of parameters worked well with dropout, but it had a negative impact to the performance without dropout.

The network contains seven weight layers. The first five are convolutional, and the last two are globally-connected. Max-pooling layers follow the layer number 1,2, and 5. And then, the output of the last globally-connected layer was fed to a 1000-way softmax output layers. Using this architecture, the authors achieved the error rate of 48.6%. When applying 50% dropout to the 6th layer, the error rate was brought down to 42.4%.

[[File:modelh2.png|700px|center]]

[[File:layer2.png|600px|center]]

Like the previous datasets, such as the MNIST, TIMIT, Reuters, and CIFAR-10, we also see a significant improvement for the ImageNet dataset. Including complicated architectures like this one, introducing dropout generalizes models better and gives lower test error rates.

= Conclusion =

The authors have shown a consistent improvement by the models trained with dropout in classifying objects in the following datasets: MNIST; TIMIT; Reuters Corpus Volume I; CIFAR-10; and ImageNet.

= Critiques =
It is a very brilliant idea to dropout half of the neurons to reduce co-adaptations. It is mentioned that for fully connected layers, dropout in all hidden layers works better than dropout in only one hidden layer. There is another paper Dropout: A Simple Way to Prevent Neural Networks from
Overfitting[https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf] gives a more detailed explanation.

It will be interesting to see how this paper could be used to prevent overfitting of LSTMs.

Firstly, it is a very interested topic of classification by "dropout" CNN method(omitting neurons in hidden layers). If the author can briefly explain the advantages of this method in processing image data in theory, it will be easier for readers to understand. Also, how to deal with overfitting issue would be valuable.

== Reference ==
[1] N. Srivastave, "Dropout: a simple way to prevent neural networks from overfitting", The Journal of Machine Learning Research, Jan 2014.

Improving neural networks by preventing co-adaption of feature detectors

2020-12-01T03:11:48Z

Y87yu: /* Critiques */

Improving neural networks by preventing co-adaption of feature detectors

2020-12-01T03:08:08Z

Y87yu: /* Critiques */

Streaming Bayesian Inference for Crowdsourced Classification

2020-12-01T03:02:21Z

Y87yu: /* Critique */

Group 4 Paper Presentation Summary

By Jonathan Chow, Nyle Dharani, Ildar Nasirov

== Motivation ==
Crowdsourcing can be a useful tool for data generation in classification projects by collecting annotations of large groups. Typically, this takes the form of online questions which many respondents will manually answer for payment. One example of this is Amazon's Mechanical Turk, a website where businesses (or "requesters") will remotely hire individuals(known as "Turkers" or "crowdsourcers") to perform human intelligence tasks (tasks that computers cannot do). In theory, it is effective in processing high volumes of small tasks that would be expensive to achieve in other methods.

The primary limitation of this method to acquire data is that respondents can submit incorrect responses so that we couldn't ensure the data quality.

Therefore, the success of crowd sourcing is limited by how well ground-truth can be determined. The primary method for doing so is probabilistic inference. However, current methods are computationally expensive, lack theoretical guarantees, or are limited to specific settings. In the meantime, there are some approaches to focus on how we sample the data from the crowd, rather than how we aggregate it to improve the accuracy of the system.

== Dawid-Skene Model for Crowdsourcing ==
The one-coin Dawid-Skene model is popular for contextualizing crowdsourcing problems. For task <math>i</math> in set <math>M</math>, let the ground-truth be the binary <math>y_i = {\pm 1}</math>. We get labels <math>X = {x_{ij}}</math> where <math>j \in N</math> is the index for that worker.

We assume that we interact with workers in sequential fashion. At each time step <math>t</math>, a worker <math>j = a(t) </math> provides their label for an assigned task <math>i</math> and provides the label <math>x_{ij} = {\pm 1}</math>. We denote responses up to time <math>t</math> via superscript.

We let <math>x_{ij} = 0</math> if worker <math>j</math> has not completed task <math>i</math>. We assume that <math>P(x_{ij} = y_i) = p_j</math>. This implies that each worker is independent and has equal probability of correctly labelling regardless of task. In crowdsourcing the data, we must determine how workers are assigned to tasks. We introduce two methods.

Under uniform sampling, workers are allocated to tasks such that each task is completed by the same number of workers, rounded to the nearest integer, and no worker completes a task more than once. This policy is given by <center><math>\pi_{uni}(t) = {\rm argmin}_{i \notin M_{a(t)}^t}\{ | N_i^t | \}.</math></center>

Under uncertainty sampling, we assign more workers to tasks that are less certain. Assuming, we are able to estimate the posterior probability of ground-truth, we can allocate workers to the task with the lowest probability of falling into the predicted class. This policy is given by <center><math>\pi_{us}(t) = {\rm argmin}_{i \notin M_{a(t)}^t}\{ (max_{k \in \{\pm 1\}} ( P(y_i = k | X^t) ) \}.</math></center>

We then need to aggregate the data. The simple method of majority voting makes predictions for a given task based on the class the most workers have assigned it, <math>\hat{y}_i = \text{sign}\{\sum_{j \in N_i} x_{ij}\}</math>.

== Streaming Bayesian Inference for Crowdsourced Classification (SBIC) ==
The aim of the SBIC algorithm is to estimate the posterior probability, <math>P(y, p | X^t, \theta)</math> where <math>X^t</math> are the observed responses at time <math>t</math> and <math>\theta</math> is our prior. We can then generate predictions <math>\hat{y}^t</math> as the marginal probability over each <math>y_i</math> given <math>X^t</math>, and <math>\theta</math>.

We factor <math>P(y, p | X^t, \theta) \approx \prod_{i \in M} \mu_i^t (y_i) \prod_{j \in N} \nu_j^t (p_j) </math> where <math>\mu_i^t</math> corresponds to each task and <math>\nu_j^t</math> to each worker.

We then sequentially optimize the factors <math>\mu^t</math> and <math>\nu^t</math>. We begin by assuming that the worker accuracy follows a beta distribution with parameters <math>\alpha</math> and <math>\beta</math>. Initialize the task factors <math>\mu_i^0(+1) = q</math> and <math>\mu_i^0(-1) = 1 – q</math> for all <math>i</math>.

When a new label is observed at time <math>t</math>, we update the <math>\nu_j^t</math> of worker <math>j</math>. We then update <math>\mu_i</math>. These updates are given by

<center><math>\nu_j^t(p_j) \sim \text{Beta}(\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_{ij}) + \alpha, \sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(-x_{ij}) + \beta) </math></center>

<center><math>\mu_i^t(y_i) \propto \begin{cases} \mu_i^{t - 1}(y_i)\overline{p}_j^t & x_{ij} = y_i \\ \mu_i^{t - 1}(y_i)(1 - \overline{p}_j^t) & x_{ij} \ne y_i \end{cases}</math></center>
where <math>\hat{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_{ij}) + \alpha}{|M_j^{t - 1}| + \alpha + \beta }</math>.

We choose our predictions to be the maximum <math>\mu_i^t(k) </math> for <math>k= \{-1,1\}</math>.

Depending on our ordering of labels <math>X</math>, we can select for different applications.

== Fast SBIC ==
The pseudocode for Fast SBIC is shown below.

<center>[[Image:FastSBIC.png|800px|]]</center>

As the name implies, the goal of this algorithm is speed. To facilitate this, we leave the order of <math>X</math> unchanged.

We express <math>\mu_i^t</math> in terms of its log-odds
<center><math>z_i^t = \log(\frac{\mu_i^t(+1)}{ \mu_i^t(-1)}) = z_i^{t - 1} + x_{ij} \log(\frac{\overline{p}_j^t}{1 - \overline{p}_j^t })</math></center>
where <math>z_i^0 = \log(\frac{q}{1 - q})</math>.

The product chain then becomes a summation and removes the need to normalize each <math>\mu_i^t</math>. We use these log-odds to compute worker accuracy,

<center><math>\overline{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} \sigma(x_{ij} z_i^{t-1}) + \alpha}{|M_j^{t - 1}| + \alpha + \beta}</math></center>
where <math>\sigma(z_i^{t-1}) := \frac{1}{1 + exp(-z_i^{t - 1})} = \mu_i^{t - 1}(+1) </math>

The final predictions are made by choosing class <math>\hat{y}_i^T = \text{sign}(z_i^T) </math>. We see later that Fast SBIC has similar computational speed to majority voting.

== Sorted SBIC ==
To increase the accuracy of the SBIC algorithm in exchange for computational efficiency, we run the algorithm in parallel giving labels in different orders. The pseudocode for this algorithm is given below.

<center>[[Image:SortedSBIC.png|800px|]]</center>

From the general discussion of SBIC, we know that predictions on task <math>i</math> are more accurate toward the end of the collection process. This is a result of observing more data points and having run more updates on <math>\mu_i^t</math> and <math>\nu_j^t</math> to move them further from their prior. This means that task <math>i</math> is predicted more accurately when its corresponding labels are seen closer to the end of the process.

We take advantage of this property by maintaining a distinct “view” of the log-odds for each task. When a label is observed, we update views for all tasks except the one for which the label was observed. At the end of the collection process, we process skipped labels. When run online, this process must be repeated at every timestep.

We see that sorted SBIC is slower than Fast SBIC by a factor of M, the number of tasks. However, we can reduce the complexity by viewing <math>s^k</math> across different tasks in an offline setting when the whole dataset is known in advance.

== Theoretical Analysis ==
The authors prove an exponential relationship between the error probability and the number of labels per task. This allows for an upper bound to be placed on the error, in the asymptotic regime, where it can be assumed the SBIC predictions are close to the true values. The two theorems, for the different sampling regimes, are presented below.

<center>[[Image:Theorem1.png|800px|]]</center>

<center>[[Image:Theorem2.png|800px|]]</center>

== Empirical Analysis ==
The purpose of the empirical analysis is to compare SBIC to the existing state of the art algorithms. The SBIC algorithm is run on five real-world binary classification datasets. The results can be found in the table below. Other algorithms in the comparison are, from left to right, majority voting, expectation-maximization, mean-field, belief-propagation, Monte-Carlo sampling, and triangular estimation.

First of all, the algorithms are run on a synthetic data that meets the assumptions of an underlying one-coin Dawid-Skene model, which allows the authors to compare SBIC's performance empirically with the theoretical results previously shown.

Second, the algorithms are run on real-world data. The performance of each algorithm is shown in the table below.

<center>[[Image:RealWorldResults.png|800px|]]</center>

In bold are the best performing algorithms for each dataset. We see that both versions of the SBIC algorithm are competitive, having similar prediction errors to EM, AMF, and MC. All are considered state-of-the-art Bayesian algorithms.

The figure below shows the average time required to simulate predictions on synthetic data under an uncertainty sampling policy. We see that Fast SBIC is comparable to majority voting and significantly faster than the other algorithms. This speed improvement, coupled with comparable accuracy, makes the Fast SBIC algorithm powerful.

<center>[[Image:TimeRequirement.png|800px|]]</center>

== Conclusion and Future Research ==
In conclusion, we have seen that SBIC is computationally efficient, accurate in practice, and has theoretical guarantees. The authors intend to extend the algorithm to the multi-class case in the future.

== Critique ==
In crowdsourcing data, the cost associated with collecting additional labels is not usually prohibitively expensive. As a result, if there is concern over ground-truth, paying for additional data to ensure X is sufficiently dense may be the desired response as opposed to sacrificing ground-truth accuracy. This may result in the SBIC algorithm being less practically useful than intended. Perhaps, a study can be done to compare the cost of acquiring additional data and checking how much it improves the accuracy of ground-truth. This can help us decide the usefulness of this algorithm. Second, SBIC should be used in multiple types of data like audio, text, and images to make sure that it delivers consistent results.

The paper is tackling the classic problem of aggregating labels in a crowdsourced application, with a focus on speed. The algorithms proposed are fast and simple to implement and come with theoretical guarantees on the bounds for error rates. However, the paper starts with an objective of designing fast label aggregation algorithms for a streaming setting yet doesn’t spend any time motivating the applications in which such algorithms are needed. All the datasets used in the empirical analysis are static datasets therefore for the paper to be useful, the problem considered should be well motivated. It also appears that the output from the algorithm depends on the order in which the data is processed, which may need to be clarified. Finally, the theoretical results are presented under the assumption that the predictions of the FBI converge to the ground truth, however, the reasoning behind this assumption is not explained.

The paper assumes that crowdsourcing from human beings is systematic: that is, respondents to the problems would act in similar ways that can be classified into some categories. There are lots of other factors that need to be considered for a human respondent, such as fatigue effects and conflicts of interest. Those factors would seriously jeopardize the validity of the results and the model if they were not carefully designed and taken care of. For example, one formally accurate subject reacts badly to the subject one day generating lots of faulty data, and it would take lots of correct votes to even out the effects. Even in lots of medical experiment that involves human subjects, with rigorous standards and procedures, the results could still be invalid. The trade-off for speed while sacrificing the validity of results is not wise.

When introducing the Streaming Bayesian Inference for Crowdsourced Classification method, it was explicitly mentioned that one can select for different applications based on the ordering of labels X. However, "applications" mentioned here were not further explained or explored to support the effectiveness and meaning of developing such an algorithm. Thus, it would be sufficient for the author to build a connection between the proposed algorithm and its real-world application to make this summary more purposeful and engaging.

In the Dawid-Skene Model for Crowdsourcing part, the summary just indicated the simplest part of the data aggregation which is unusual in the real world. Instead, we can use Bayesian methods which infer the value of the latent variables y and p by estimating their posterior probability P(y,p|X,θ) given the observed data X and prior θ.

As a mathematical model, which type of classification dataset(such as audio or text) Dawid-Skene Model has potential advantages, can be discussed in detail in "future research" to give readers a more intuitive experience.

== References ==
[1] Manino, Tran-Thanh, and Jennings. Streaming Bayesian Inference for Crowdsourced Classification. 33rd Conference on Neural Information Processing Systems, 2019

Efficient kNN Classification with Different Numbers of Nearest Neighbors

2020-12-01T02:49:54Z

Y87yu: /* Critiques */

== Presented by ==
Cooper Brooke, Daniel Fagan, Maya Perelman

== Introduction ==
Traditional model-based classification approaches first use training observations to fit a model before predicting test samples. In contrast, the model-free k-nearest neighbors (KNNs) method classifies observations with a majority rules approach. The kNN assigns test data to their class containing their k closest training observations (neighbours). This method has become very popular due to its strong performance and simple implementation.

There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples. The second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.

== Previous Work ==

Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications.

Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.

== Motivation ==

Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seeks to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.

A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.

== Approach ==

=== kTree Classification ===

The proposed kTree method is illustrated by the following flow chart:

[[File:Approach_Figure_1.png | center | 800x800px]]

==== Reconstruction ====

The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}</math> represents the training set can be written as:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2
\end{aligned}$$

In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples.

The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:

$$\begin{aligned}
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})
\end{aligned}$$

where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features.

This gives a final objective function of:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})
\end{aligned}$$

Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.

==== Calculate ''k'' for training set ====

Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.

For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:

[[File:Approach_Figure_2.png | center | 300x300px]]

The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 is non-zero.

==== Train a Decision Tree using ''k'' as the label ====

In a normal decision tree, the target data is the labels themselves. In contrast, in the kTree method, the target data is the optimal ''k'' values for each sample that were solved for in the previous step. So this decision tree has the following form:

[[File:Approach_Figure_3.png | center | 300x300px]]

==== Making Predictions for Test Data ====

The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbours across '''all''' of the training data.

=== k*Tree Classification ===

The proposed k*Tree method is illustrated by the following flow chart:

[[File:Approach_Figure_4.png | center | 1000x1000px]]

Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.

While all steps previous are the exact same, the k*Tree method not only stores the optimal ''k'' value but also the following information:

* The training samples that have the same optimal ''k''
* The ''k'' nearest neighbours of the previously identified training samples
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours

The data stored in each node is summarized in the following figure:

[[File:Approach_Figure_5.png | center | 800x800px]]

In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.

== Experiments ==

In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:

# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.
# kNN-Based Applicability Domain Approach (AD-kNN) [11]
# kNN Method Based on Sparse Learning (S-kNN) [10]
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]

The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features.

'''A. Experimental Results on Different Sample Sizes'''

The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.

[[File:Table_I_kNN.png | center | 1000x1000px]]

The following key results are noted:
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.

'''B. Experimental Results on Different Feature Numbers'''

The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score [15] approach was used to rank and select the most information features in the datasets.

[[File:Table_II_kNN.png | center | 1000x1000px]]

From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.

== Conclusion ==

This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods achieve efficient classification by designing a training step that reduces the run time of the test stage. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features.

== Critiques ==

*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy.
* The authors addressed that some of the UCI datasets contained imbalance data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.).
*While the authors contrast their ktTee and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.

* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.

* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.

* It would be really helpful if Ktrees method can be explained at the very beginning. The transition from KNN to Ktrees are not very smooth.

* It would be nice to have comparison of the running costs of different methods to see how much cost the kTree and k*Tree reduced.

* It would be better to show the key result only on a summary rather than stacking up all results without screening.

== References ==

[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.

[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.

[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.

[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.

[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.

[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.

[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.

[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.

[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.

[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.

[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.

[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.

[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.

Efficient kNN Classification with Different Numbers of Nearest Neighbors

2020-12-01T02:48:17Z

Y87yu: /* Experiments */

== Presented by ==
Cooper Brooke, Daniel Fagan, Maya Perelman

== Introduction ==
Traditional model-based classification approaches first use training observations to fit a model before predicting test samples. In contrast, the model-free k-nearest neighbors (KNNs) method classifies observations with a majority rules approach. The kNN assigns test data to their class containing their k closest training observations (neighbours). This method has become very popular due to its strong performance and simple implementation.

There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples. The second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.

== Previous Work ==

Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications.

Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.

== Motivation ==

Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seeks to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.

A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.

== Approach ==

=== kTree Classification ===

The proposed kTree method is illustrated by the following flow chart:

[[File:Approach_Figure_1.png | center | 800x800px]]

==== Reconstruction ====

The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}</math> represents the training set can be written as:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2
\end{aligned}$$

In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples.

The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:

$$\begin{aligned}
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})
\end{aligned}$$

where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features.

This gives a final objective function of:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})
\end{aligned}$$

Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.

==== Calculate ''k'' for training set ====

Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.

For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:

[[File:Approach_Figure_2.png | center | 300x300px]]

The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 is non-zero.

==== Train a Decision Tree using ''k'' as the label ====

In a normal decision tree, the target data is the labels themselves. In contrast, in the kTree method, the target data is the optimal ''k'' values for each sample that were solved for in the previous step. So this decision tree has the following form:

[[File:Approach_Figure_3.png | center | 300x300px]]

==== Making Predictions for Test Data ====

The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbours across '''all''' of the training data.

=== k*Tree Classification ===

The proposed k*Tree method is illustrated by the following flow chart:

[[File:Approach_Figure_4.png | center | 1000x1000px]]

Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.

While all steps previous are the exact same, the k*Tree method not only stores the optimal ''k'' value but also the following information:

* The training samples that have the same optimal ''k''
* The ''k'' nearest neighbours of the previously identified training samples
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours

The data stored in each node is summarized in the following figure:

[[File:Approach_Figure_5.png | center | 800x800px]]

In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.

== Experiments ==

In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:

# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.
# kNN-Based Applicability Domain Approach (AD-kNN) [11]
# kNN Method Based on Sparse Learning (S-kNN) [10]
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]

The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features.

'''A. Experimental Results on Different Sample Sizes'''

The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.

[[File:Table_I_kNN.png | center | 1000x1000px]]

The following key results are noted:
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.

'''B. Experimental Results on Different Feature Numbers'''

The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score [15] approach was used to rank and select the most information features in the datasets.

[[File:Table_II_kNN.png | center | 1000x1000px]]

From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.

== Conclusion ==

This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods achieve efficient classification by designing a training step that reduces the run time of the test stage. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features.

== Critiques ==

*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy.
* The authors addressed that some of the UCI datasets contained imbalance data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.).
*While the authors contrast their ktTee and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.

* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.

* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.

* It would be really helpful if Ktrees method can be explained at the very beginning. The transition from KNN to Ktrees are not very smooth.

* It would be nice to have comparison of the running costs of different methods to see how much cost the kTree and k*Tree reduced.

== References ==

[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.

[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.

[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.

[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.

[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.

[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.

[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.

[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.

[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.

[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.

[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.

[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.

[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.

Efficient kNN Classification with Different Numbers of Nearest Neighbors

2020-12-01T02:48:02Z

Y87yu: /* Experiments */

== Presented by ==
Cooper Brooke, Daniel Fagan, Maya Perelman

== Introduction ==
Traditional model-based classification approaches first use training observations to fit a model before predicting test samples. In contrast, the model-free k-nearest neighbors (KNNs) method classifies observations with a majority rules approach. The kNN assigns test data to their class containing their k closest training observations (neighbours). This method has become very popular due to its strong performance and simple implementation.

There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples. The second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.

== Previous Work ==

Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications.

Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.

== Motivation ==

Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seeks to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.

A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.

== Approach ==

=== kTree Classification ===

The proposed kTree method is illustrated by the following flow chart:

[[File:Approach_Figure_1.png | center | 800x800px]]

==== Reconstruction ====

The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}</math> represents the training set can be written as:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2
\end{aligned}$$

In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples.

The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:

$$\begin{aligned}
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})
\end{aligned}$$

where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features.

This gives a final objective function of:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})
\end{aligned}$$

Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.

==== Calculate ''k'' for training set ====

Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.

For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:

[[File:Approach_Figure_2.png | center | 300x300px]]

The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 is non-zero.

==== Train a Decision Tree using ''k'' as the label ====

In a normal decision tree, the target data is the labels themselves. In contrast, in the kTree method, the target data is the optimal ''k'' values for each sample that were solved for in the previous step. So this decision tree has the following form:

[[File:Approach_Figure_3.png | center | 300x300px]]

==== Making Predictions for Test Data ====

The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbours across '''all''' of the training data.

=== k*Tree Classification ===

The proposed k*Tree method is illustrated by the following flow chart:

[[File:Approach_Figure_4.png | center | 1000x1000px]]

Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.

While all steps previous are the exact same, the k*Tree method not only stores the optimal ''k'' value but also the following information:

* The training samples that have the same optimal ''k''
* The ''k'' nearest neighbours of the previously identified training samples
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours

The data stored in each node is summarized in the following figure:

[[File:Approach_Figure_5.png | center | 800x800px]]

In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.

== Experiments ==

In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:

# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.
# kNN-Based Applicability Domain Approach (AD-kNN) [11]
# kNN Method Based on Sparse Learning (S-kNN) [10]
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]

The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features.

'''A. Experimental Results on Different Sample Sizes'''

The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.

[[File:Table_I_kNN.png | center | 1000x1000px]]

The following key results are noted:
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.

'''B. Experimental Results on Different Feature Numbers'''

The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score [15] approach was used to rank and select the most information features in the datasets.

[[File:Table_II_kNN.png | center | 800x800px]]

From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.

== Conclusion ==

This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods achieve efficient classification by designing a training step that reduces the run time of the test stage. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features.

== Critiques ==

*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy.
* The authors addressed that some of the UCI datasets contained imbalance data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.).
*While the authors contrast their ktTee and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.

* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.

* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.

* It would be really helpful if Ktrees method can be explained at the very beginning. The transition from KNN to Ktrees are not very smooth.

* It would be nice to have comparison of the running costs of different methods to see how much cost the kTree and k*Tree reduced.

== References ==

[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.

[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.

[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.

[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.

[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.

[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.

[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.

[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.

[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.

[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.

[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.

[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.

[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.

Efficient kNN Classification with Different Numbers of Nearest Neighbors

2020-12-01T02:47:18Z

Y87yu: /* k*Tree Classification */

== Presented by ==
Cooper Brooke, Daniel Fagan, Maya Perelman

== Introduction ==
Traditional model-based classification approaches first use training observations to fit a model before predicting test samples. In contrast, the model-free k-nearest neighbors (KNNs) method classifies observations with a majority rules approach. The kNN assigns test data to their class containing their k closest training observations (neighbours). This method has become very popular due to its strong performance and simple implementation.

There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples. The second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.

== Previous Work ==

Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications.

Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.

== Motivation ==

Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seeks to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.

A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.

== Approach ==

=== kTree Classification ===

The proposed kTree method is illustrated by the following flow chart:

[[File:Approach_Figure_1.png | center | 800x800px]]

==== Reconstruction ====

The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}</math> represents the training set can be written as:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2
\end{aligned}$$

In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples.

The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:

$$\begin{aligned}
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})
\end{aligned}$$

where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features.

This gives a final objective function of:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})
\end{aligned}$$

Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.

==== Calculate ''k'' for training set ====

Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.

For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:

[[File:Approach_Figure_2.png | center | 300x300px]]

The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 is non-zero.

==== Train a Decision Tree using ''k'' as the label ====

In a normal decision tree, the target data is the labels themselves. In contrast, in the kTree method, the target data is the optimal ''k'' values for each sample that were solved for in the previous step. So this decision tree has the following form:

[[File:Approach_Figure_3.png | center | 300x300px]]

==== Making Predictions for Test Data ====

The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbours across '''all''' of the training data.

=== k*Tree Classification ===

The proposed k*Tree method is illustrated by the following flow chart:

[[File:Approach_Figure_4.png | center | 1000x1000px]]

Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.

While all steps previous are the exact same, the k*Tree method not only stores the optimal ''k'' value but also the following information:

* The training samples that have the same optimal ''k''
* The ''k'' nearest neighbours of the previously identified training samples
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours

The data stored in each node is summarized in the following figure:

[[File:Approach_Figure_5.png | center | 800x800px]]

In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.

== Experiments ==

In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:

# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.
# kNN-Based Applicability Domain Approach (AD-kNN) [11]
# kNN Method Based on Sparse Learning (S-kNN) [10]
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]

The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features.

'''A. Experimental Results on Different Sample Sizes'''

The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.

[[File:Table_I_kNN.png | center | 800x800px]]

The following key results are noted:
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.

'''B. Experimental Results on Different Feature Numbers'''

The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score [15] approach was used to rank and select the most information features in the datasets.

[[File:Table_II_kNN.png | center | 800x800px]]

From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.

== Conclusion ==

This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods achieve efficient classification by designing a training step that reduces the run time of the test stage. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features.

== Critiques ==

*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy.
* The authors addressed that some of the UCI datasets contained imbalance data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.).
*While the authors contrast their ktTee and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.

* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.

* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.

* It would be really helpful if Ktrees method can be explained at the very beginning. The transition from KNN to Ktrees are not very smooth.

* It would be nice to have comparison of the running costs of different methods to see how much cost the kTree and k*Tree reduced.

== References ==

[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.

[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.

[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.

[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.

[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.

[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.

[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.

[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.

[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.

[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.

[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.

[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.

[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.

User:Yktan

2020-12-01T02:43:39Z

Y87yu: /* References */

== Introduction ==

Much of the success in training deep neural networks (DNNs) is due to the collection of large datasets with human-annotated labels. However, human annotation is both a time-consuming and expensive task, especially for data that requires expertise such as medical data. Furthermore, certain datasets will be noisy due to the biases introduced by different annotators. Data obtained in large quantities through searching for images in search engines and data downloaded from social media sites (in a manner abiding by privacy and copyright laws) are especially noisy, since the labels are generally inferred from tags to save on human-annotation cost.

There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. Other LNL methods estimate a noise transition matrix and employ it to correct the loss function. An example of a popular loss correction approach is the bootstrapping loss approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples. The main limitation of these methods is that they do not perform well under high noise ratio and cause overfitting.

This paper introduces DivideMix, which combines approaches from LNL and SSL. One unique thing about DivideMix is that it discards sample labels that are highly likely to be noisy and leverages these noisy samples as unlabeled data instead. This prevents the model from overfitting and improves generalization performance. Key contributions of this work are:
1) Co-divide, which trains two networks simultaneously, aims to improve generalization and avoid confirmation bias.
2) During the SSL phase, an improvement is made on an existing method (MixMatch) by combining it with another method (MixUp).
3) Significant improvements to state-of-the-art results on multiple conditions are experimentally shown while using DivideMix. Extensive ablation study and qualitative results are also shown to examine the effect of different components.

== Motivation ==

While much has been achieved in training DNNs with noisy labels and SSL methods individually, not much progress has been made in exploring their underlying connections and building on top of the two approaches simultaneously.

Existing LNL methods aim to correct the loss function by:
<ol>
<li> Treating all samples equally and correcting loss explicitly or implicitly through relabelling of the noisy samples
<li> Reweighting training samples or separating clean and noisy samples, which results in correction of the loss function
</ol>

A few examples of LNL methods include:
<ol>
<li> Estimating the noise transition matrix, which denotes the probability of clean labels flipping to noisy labels, to correct the loss function
<li> Leveraging the predictions from DNNs to correct labels and using them to modify the loss
<li> Reweighting samples so that noisy labels contribute less to the loss
</ol>

However, these methods all have downsides: it is very challenging to correctly estimate the noise transition matrix in the first method; for the second method, DNNs tend to overfit to datasets with high noise ratio; and for the third method, we need to be able to identify clean samples, which has also proven to be challenging.

On the other hand, SSL methods mostly leverage unlabeled data using regularization to improve model performance. A recently proposed method, MixMatch, incorporates the two classes of regularization. These classes are consistency regularization which enforces the model to produce consistent predictions on augmented input data, and entropy minimization which encourages the model to give high-confidence predictions on unlabeled data, as well as MixUp regularization.

DivideMix partially adopts LNL in that it removes the labels that are highly likely to be noisy by using co-divide to avoid the confirmation bias problem. It then utilizes the noisy samples as unlabeled data and adopts an improved version of MixMatch (an SSL technique) which accounts for the label noise during the label co-refinement and co-guessing phase. By incorporating SSL techniques into LNL and taking the best of both worlds, DivideMix aims to produce highly promising results in training DNNs by better addressing the confirmation bias problem, more accurately distinguishing and utilizing noisy samples, and performing well under high levels of noise.

== Model Architecture and Algorithm ==

DivideMix leverages semi-supervised learning to achieve effective modeling. The sample is first split into a labeled set and an unlabeled set. This is achieved by fitting a Gaussian Mixture Model as a per-sample loss distribution. The unlabeled set is made up of data points with discarded labels deemed noisy. Then, to avoid confirmation bias, which is typical when a model is self-training, two models are being trained simultaneously to filter error for each other. This is done by dividing the data using one model and then training the other model. This algorithm, known as Co-divide, keeps the two networks from converging when training, which avoids the bias from occurring. Being diverged also offers the two networks distinct abilities to filter different types of error, making the model more robust to noise. Figure 1 describes the algorithm in graphical form.

[[File:ModelArchitecture.PNG | center]]

<div align="center">Figure 1: Model Architecture of DivideMix</div>

For each epoch, the network divides the dataset into a labeled set consisting of clean data, and an unlabeled set consisting of noisy data, which is then used as training data for the other network, where training is done in mini-batches. For each batch of the labelled samples, co-refinement is performed by using the ground truth label <math> y_b </math>, the predicted label <math> p_b </math>, and the posterior is used as the weight, <math> w_b </math>.

<center><math> \bar{y}_b = w_b y_b + (1-w_b) p_b </math></center>

Then, a sharpening function is implemented on this weighted sum to produce the estimate with reduced temperature, <math> \hat{y}_b </math>.

<center><math> \hat{y}_b=Sharpen(\bar{y}_b,T)={\bar{y}^{c{\frac{1}{T}}}_b}/{\sum_{c=1}^C\bar{y}^{c{\frac{1}{T}}}_b} </math>, for <math>c = 1, 2,..,C</math></center>

Using all these predicted labels, the unlabeled samples will then be assigned a "co-guessed" label, which should produce a more accurate prediction. Having calculated all these labels, MixMatch is applied to the combined mini-batch of labeled, <math> \hat{X} </math> and unlabeled data, <math> \hat{U} </math>, where, for a pair of samples and their labels, one new sample and new label is produced. More specifically, for a pair of samples <math> (x_1,x_2) </math> and their labels <math> (p_1,p_2) </math>, the mixed sample <math> (x',p') </math> is:

<center>
<math>
\begin{alignat}{2}

\lambda &\sim Beta(\alpha, \alpha) \\
\lambda ' &= max(\lambda, 1 - \lambda) \\
x' &= \lambda ' x_1 + (1 - \lambda ' ) x_2 \\
p' &= \lambda ' p_1 + (1 - \lambda ' ) p_2 \\

\end{alignat}
</math>
</center>

MixMatch transforms <math> \hat{X} </math> and <math> \hat{U} </math> into <math> X' </math> and <math> U' </math>. Then, the loss on <math> X' </math>, <math> L_X </math> (Cross-entropy loss) and the loss on <math> U' </math>, <math> L_U </math> (Mean Squared Error) are calculated. A regularization term, <math> L_{reg} </math>, is introduced to regularize the model's average output across all samples in the mini-batch. Then, the total loss is calculated as:

<center><math> L = L_X + \lambda_u L_U + \lambda_r L_{reg} </math></center>

where <math> \lambda_r </math> is set to 1, and <math> \lambda_u </math> is used to control the unsupervised loss.

Lastly, the stochastic gradient descent formula is updated with the calculated loss, <math> L </math>, and the estimated parameters, <math> \boldsymbol{ \theta } </math>.

The full algorithm is shown below. [[File:dividemix.jpg|600px| | center]]
<div align="center">Algorithm1: DivideMix. Line 4-8: co-divide; Line 17-18: label co-refinement; Line 20: co-guessing.</div>

The when the model is warmed up, it is trained on all data using standard cross-entropy to initially converge the model, but with a regulatory negative entropy term <math>\mathcal{H} = -\sum_{c}\text{p}^\text{c}_\text{model}(x;\theta)\log(\text{p}^\text{c}_\text{model}(x;\theta))</math>, where <math>\text{p}^\text{c}_\text{model}</math> is the softmax output probability for class c. This term penalizes confident predictions during the warm up to prevent overfitting to noise during the warm up, which can happen when there is asymmetric noise.

== Results ==
'''Applications'''

The method was validated using four benchmark datasets: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009) which contain 50K training images and 10K test images of size 32 × 32), Clothing1M (Xiao et al., 2015), and WebVision (Li et al., 2017a).
Two types of label noise are used in the experiments: symmetric and asymmetric.
An 18-layer PreAct Resnet (He et al., 2016) is trained using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. The network is trained for 300 epochs. The initial learning rate was set to 0.02 and reduced by a factor of 10 after 150 epochs. Before applying the Co-divide and MixMatch strategies, the models were first independently trained over the entire dataset using cross-entropy loss during a "warm-up" period. Initially, training the models in this way prepares a more regular distribution of losses to improve upon in subsequent epochs. The warm-up period is 10 epochs for CIFAR-10 and 30 epochs for CIFAR-100. For all CIFAR experiments, we use the same hyperparameters M = 2, T = 0.5, and α = 4. τ is set as 0.5 except for 90% noise ratio when it is set as 0.6.

'''Comparison of State-of-the-Art Methods'''

The effectiveness of DivideMix was shown by comparing the test accuracy with the most recent state-of-the-art methods:
Meta-Learning (Li et al., 2019) proposes a gradient-based method to find model parameters that are more noise-tolerant;
Joint-Optim (Tanaka et al., 2018) and P-correction (Yi & Wu, 2019) jointly optimize the sample labels and the network parameters;
M-correction (Arazo et al., 2019) models sample loss with BMM and apply MixUp.
The following are the results on CIFAR-10 and CIFAR-100 with different levels of symmetric label noise ranging from 20% to 90%. Both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs were recorded in the following table:

[[File:divideMixtable1.PNG | center]]

From table 1, the author noticed that none of these methods can consistently outperform others across different datasets. M-correction excels at symmetric noise, whereas Meta-Learning performs better for asymmetric noise. DivideMix outperforms state-of-the-art methods by a large margin across all noise ratios. The improvement is substantial (∼10% of accuracy) for the more challenging CIFAR-100 with high noise ratios.

DivideMix was compared with the state-of-the-art methods with the other two datasets: Clothing1M and WebVision. It also shows that DivideMix consistently outperforms state-of-the-art methods across all datasets with different types of label noise. For WebVision, DivideMix achieves more than 12% improvement in top-1 accuracy.

'''Ablation Study'''

The effect of removing different components to provide insights into what makes DivideMix successful. We analyze the results in Table 5 as follows.

[[File:DivideMixtable5.PNG | center]]

The authors combined self-divide with the original MixMatch as a naive baseline for using SLL in LNL.
They also find that both label refinement and input augmentation are beneficial for DivideMix. ''Label refinement'' is important for high noise ratio due because samples that are noisier would be incorrectly divided into the labeled set. ''Augmentation'' upgrades model performance by creating more reliable predictions and by achieving consistent regularization. In addition, the performance drop was seen in the ''DivideMix w/o co-training'' highlights the disadvantage of self-training; the model still has dataset division, label refinement and label guessing, but they are all performed by the same model.

== Conclusion ==

This paper provides a new and effective algorithm for learning with noisy labels by using highly noisy data unlabelled data in a Semi-Supervised Learning framework. The DivideMix method trains two networks simultaneously and utilizes co-guessing and co-labeling effectively, therefore it is a robust approach to deal with noise in datasets. Also, the DivideMix method has been tested using various datasets with the results consistently being one of the best when compared to the state-of-the-art methods through extensive experiments.

Future work of DivideMix is to create an adaptation for other applications such as Natural Language Processing, and incorporating the ideas of SSL and LNL into DivideMix architecture.

== Critiques/ Insights ==

1. While combining both models makes the result better, the author did not show the relative time increase using this new combined methodology, which is very crucial considering training a large amount of data, especially for images. In addition, it seems that the author did not perform much on hyperparameters tuning for the combined model.

2. There is an interesting insight, which is when the noise ratio increases from 80% to 90%, the accuracy of DivideMix drops dramatically in both datasets.

3. There should be a further explanation of why the learning rate drops by a factor of 10 after 150 epochs.

4. It would be interesting to see the effectiveness of this method in other domains such as NLP. I am not aware of noisy training datasets available in NLP, but surely this is an important area to focus on, as much of the available data is collected from noisy sources from the web.

5. The paper implicitly assumes that a Gaussian mixture model (GMM) is sufficiently capable of identifying noise. Given the nature of a GMM, it would work well for noise that is distributed by a Gaussian distribution but for all other noise, it would probably be only asymptotic. The paper should present theoretical results on the noise that are Exponential, Rayleigh, etc. This is particularly important because the experiments were done on massive datasets, but they do not directly address the case when there are not many data points.

6. Comparing the training result on these benchmark datasets makes the algorithm quite comprehensive. This is a very insightful idea to maintain two networks to avoid bias from occurring.

7. The current benchmark accuracy for CIFAR-10 is 99.7, CIFAR-100 is 96.08 using EffNet-L2 in 2020. In 2019, CIFAR-10 is 99.37, CIFAR-100 is 93.51 using BiT-L.(based on paperswithcode.com) As there exists better methods, it would be nice to know why the authors chose these state-of-the-art methods to compare the test accuracy.

8. Another interesting observation is that DivideMix seems to maintain a similar accuracy while some methods give unstable results. That shows the reliability of the proposed algorithm.

9. It would be interesting to see if the drop in accuracy from increasing the noise ratio to 90% is a result of a low porportion or low number of clean labels. That is, would increasing the size of the training set but keeping the noise ratio at 90% result in increased accuracy?

10. For Ablation Study part, the paper also introduced a study on the Robustness of Testing Marking Methods Noise, including AUC for classification of clean/noisy samples of CIFAR-10 training data. And it shows that the method can effectively separate clean and noisy samples as training proceeds.

11. It is interesting how unlike common methods, the method in this paper discards the labels that are highly likely to be
noisy. It also utilizes the noisy samples as unlabeled data to regularize training in a SSL manner. This model can better distinguish and utilize noisy samples.

12. In the result section, the author gives us a comprehensive understanding of this algorithm by introducing the applications and the comparison of it with respect to similar methods. It would be attractive if in the application part, the author could indicate how the application relative to our daily life.

13. High quality data is very important for training Machine learning systems. Preparing the data to train ML systems requires data annotations which are prone to errors and are time-consuming. It is interesting to note how paper 14 and this paper aims to approach this problem from different perspectives. Paper 14 introduces CSL algorithm that learns from confused or Noisy data to find the tasks associated with them. And this paper proposes an algorithm that shows good performance when learning from noisy data. Hence both the papers seem to tackle similar problem and implementing the approaches described in both the papers when handling noisy data can be twice helpful.

14. Noise exists in all big data, and big data is what we are dealing with in real life nowadays. Having an effective noise eliminating method such as Dividemix is important to us.

15. The DivideMix consistently outperforms state-of-the-art methods across the given datasets, but how about some other potential datasets? If it can be given that it has advantages for a certain type of potential dataset, it will be a better discussion.

== References ==
[1] Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Unsupervised
label noise modeling and loss correction. In ICML, pp. 312–321, 2019.

[2] David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin
Raffel. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 2019.

[3] Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach
to learning from noisy labels. In WACV, pp. 1215–1224, 2018.

User:Yktan

2020-12-01T02:43:22Z

Y87yu: /* Critiques/ Insights */

== Introduction ==

Much of the success in training deep neural networks (DNNs) is due to the collection of large datasets with human-annotated labels. However, human annotation is both a time-consuming and expensive task, especially for data that requires expertise such as medical data. Furthermore, certain datasets will be noisy due to the biases introduced by different annotators. Data obtained in large quantities through searching for images in search engines and data downloaded from social media sites (in a manner abiding by privacy and copyright laws) are especially noisy, since the labels are generally inferred from tags to save on human-annotation cost.

There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. Other LNL methods estimate a noise transition matrix and employ it to correct the loss function. An example of a popular loss correction approach is the bootstrapping loss approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples. The main limitation of these methods is that they do not perform well under high noise ratio and cause overfitting.

This paper introduces DivideMix, which combines approaches from LNL and SSL. One unique thing about DivideMix is that it discards sample labels that are highly likely to be noisy and leverages these noisy samples as unlabeled data instead. This prevents the model from overfitting and improves generalization performance. Key contributions of this work are:
1) Co-divide, which trains two networks simultaneously, aims to improve generalization and avoid confirmation bias.
2) During the SSL phase, an improvement is made on an existing method (MixMatch) by combining it with another method (MixUp).
3) Significant improvements to state-of-the-art results on multiple conditions are experimentally shown while using DivideMix. Extensive ablation study and qualitative results are also shown to examine the effect of different components.

== Motivation ==

While much has been achieved in training DNNs with noisy labels and SSL methods individually, not much progress has been made in exploring their underlying connections and building on top of the two approaches simultaneously.

Existing LNL methods aim to correct the loss function by:
<ol>
<li> Treating all samples equally and correcting loss explicitly or implicitly through relabelling of the noisy samples
<li> Reweighting training samples or separating clean and noisy samples, which results in correction of the loss function
</ol>

A few examples of LNL methods include:
<ol>
<li> Estimating the noise transition matrix, which denotes the probability of clean labels flipping to noisy labels, to correct the loss function
<li> Leveraging the predictions from DNNs to correct labels and using them to modify the loss
<li> Reweighting samples so that noisy labels contribute less to the loss
</ol>

However, these methods all have downsides: it is very challenging to correctly estimate the noise transition matrix in the first method; for the second method, DNNs tend to overfit to datasets with high noise ratio; and for the third method, we need to be able to identify clean samples, which has also proven to be challenging.

On the other hand, SSL methods mostly leverage unlabeled data using regularization to improve model performance. A recently proposed method, MixMatch, incorporates the two classes of regularization. These classes are consistency regularization which enforces the model to produce consistent predictions on augmented input data, and entropy minimization which encourages the model to give high-confidence predictions on unlabeled data, as well as MixUp regularization.

DivideMix partially adopts LNL in that it removes the labels that are highly likely to be noisy by using co-divide to avoid the confirmation bias problem. It then utilizes the noisy samples as unlabeled data and adopts an improved version of MixMatch (an SSL technique) which accounts for the label noise during the label co-refinement and co-guessing phase. By incorporating SSL techniques into LNL and taking the best of both worlds, DivideMix aims to produce highly promising results in training DNNs by better addressing the confirmation bias problem, more accurately distinguishing and utilizing noisy samples, and performing well under high levels of noise.

== Model Architecture and Algorithm ==

DivideMix leverages semi-supervised learning to achieve effective modeling. The sample is first split into a labeled set and an unlabeled set. This is achieved by fitting a Gaussian Mixture Model as a per-sample loss distribution. The unlabeled set is made up of data points with discarded labels deemed noisy. Then, to avoid confirmation bias, which is typical when a model is self-training, two models are being trained simultaneously to filter error for each other. This is done by dividing the data using one model and then training the other model. This algorithm, known as Co-divide, keeps the two networks from converging when training, which avoids the bias from occurring. Being diverged also offers the two networks distinct abilities to filter different types of error, making the model more robust to noise. Figure 1 describes the algorithm in graphical form.

[[File:ModelArchitecture.PNG | center]]

<div align="center">Figure 1: Model Architecture of DivideMix</div>

For each epoch, the network divides the dataset into a labeled set consisting of clean data, and an unlabeled set consisting of noisy data, which is then used as training data for the other network, where training is done in mini-batches. For each batch of the labelled samples, co-refinement is performed by using the ground truth label <math> y_b </math>, the predicted label <math> p_b </math>, and the posterior is used as the weight, <math> w_b </math>.

<center><math> \bar{y}_b = w_b y_b + (1-w_b) p_b </math></center>

Then, a sharpening function is implemented on this weighted sum to produce the estimate with reduced temperature, <math> \hat{y}_b </math>.

<center><math> \hat{y}_b=Sharpen(\bar{y}_b,T)={\bar{y}^{c{\frac{1}{T}}}_b}/{\sum_{c=1}^C\bar{y}^{c{\frac{1}{T}}}_b} </math>, for <math>c = 1, 2,..,C</math></center>

Using all these predicted labels, the unlabeled samples will then be assigned a "co-guessed" label, which should produce a more accurate prediction. Having calculated all these labels, MixMatch is applied to the combined mini-batch of labeled, <math> \hat{X} </math> and unlabeled data, <math> \hat{U} </math>, where, for a pair of samples and their labels, one new sample and new label is produced. More specifically, for a pair of samples <math> (x_1,x_2) </math> and their labels <math> (p_1,p_2) </math>, the mixed sample <math> (x',p') </math> is:

<center>
<math>
\begin{alignat}{2}

\lambda &\sim Beta(\alpha, \alpha) \\
\lambda ' &= max(\lambda, 1 - \lambda) \\
x' &= \lambda ' x_1 + (1 - \lambda ' ) x_2 \\
p' &= \lambda ' p_1 + (1 - \lambda ' ) p_2 \\

\end{alignat}
</math>
</center>

MixMatch transforms <math> \hat{X} </math> and <math> \hat{U} </math> into <math> X' </math> and <math> U' </math>. Then, the loss on <math> X' </math>, <math> L_X </math> (Cross-entropy loss) and the loss on <math> U' </math>, <math> L_U </math> (Mean Squared Error) are calculated. A regularization term, <math> L_{reg} </math>, is introduced to regularize the model's average output across all samples in the mini-batch. Then, the total loss is calculated as:

<center><math> L = L_X + \lambda_u L_U + \lambda_r L_{reg} </math></center>

where <math> \lambda_r </math> is set to 1, and <math> \lambda_u </math> is used to control the unsupervised loss.

Lastly, the stochastic gradient descent formula is updated with the calculated loss, <math> L </math>, and the estimated parameters, <math> \boldsymbol{ \theta } </math>.

The full algorithm is shown below. [[File:dividemix.jpg|600px| | center]]
<div align="center">Algorithm1: DivideMix. Line 4-8: co-divide; Line 17-18: label co-refinement; Line 20: co-guessing.</div>

The when the model is warmed up, it is trained on all data using standard cross-entropy to initially converge the model, but with a regulatory negative entropy term <math>\mathcal{H} = -\sum_{c}\text{p}^\text{c}_\text{model}(x;\theta)\log(\text{p}^\text{c}_\text{model}(x;\theta))</math>, where <math>\text{p}^\text{c}_\text{model}</math> is the softmax output probability for class c. This term penalizes confident predictions during the warm up to prevent overfitting to noise during the warm up, which can happen when there is asymmetric noise.

== Results ==
'''Applications'''

The method was validated using four benchmark datasets: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009) which contain 50K training images and 10K test images of size 32 × 32), Clothing1M (Xiao et al., 2015), and WebVision (Li et al., 2017a).
Two types of label noise are used in the experiments: symmetric and asymmetric.
An 18-layer PreAct Resnet (He et al., 2016) is trained using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. The network is trained for 300 epochs. The initial learning rate was set to 0.02 and reduced by a factor of 10 after 150 epochs. Before applying the Co-divide and MixMatch strategies, the models were first independently trained over the entire dataset using cross-entropy loss during a "warm-up" period. Initially, training the models in this way prepares a more regular distribution of losses to improve upon in subsequent epochs. The warm-up period is 10 epochs for CIFAR-10 and 30 epochs for CIFAR-100. For all CIFAR experiments, we use the same hyperparameters M = 2, T = 0.5, and α = 4. τ is set as 0.5 except for 90% noise ratio when it is set as 0.6.

'''Comparison of State-of-the-Art Methods'''

The effectiveness of DivideMix was shown by comparing the test accuracy with the most recent state-of-the-art methods:
Meta-Learning (Li et al., 2019) proposes a gradient-based method to find model parameters that are more noise-tolerant;
Joint-Optim (Tanaka et al., 2018) and P-correction (Yi & Wu, 2019) jointly optimize the sample labels and the network parameters;
M-correction (Arazo et al., 2019) models sample loss with BMM and apply MixUp.
The following are the results on CIFAR-10 and CIFAR-100 with different levels of symmetric label noise ranging from 20% to 90%. Both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs were recorded in the following table:

[[File:divideMixtable1.PNG | center]]

From table 1, the author noticed that none of these methods can consistently outperform others across different datasets. M-correction excels at symmetric noise, whereas Meta-Learning performs better for asymmetric noise. DivideMix outperforms state-of-the-art methods by a large margin across all noise ratios. The improvement is substantial (∼10% of accuracy) for the more challenging CIFAR-100 with high noise ratios.

DivideMix was compared with the state-of-the-art methods with the other two datasets: Clothing1M and WebVision. It also shows that DivideMix consistently outperforms state-of-the-art methods across all datasets with different types of label noise. For WebVision, DivideMix achieves more than 12% improvement in top-1 accuracy.

'''Ablation Study'''

The effect of removing different components to provide insights into what makes DivideMix successful. We analyze the results in Table 5 as follows.

[[File:DivideMixtable5.PNG | center]]

The authors combined self-divide with the original MixMatch as a naive baseline for using SLL in LNL.
They also find that both label refinement and input augmentation are beneficial for DivideMix. ''Label refinement'' is important for high noise ratio due because samples that are noisier would be incorrectly divided into the labeled set. ''Augmentation'' upgrades model performance by creating more reliable predictions and by achieving consistent regularization. In addition, the performance drop was seen in the ''DivideMix w/o co-training'' highlights the disadvantage of self-training; the model still has dataset division, label refinement and label guessing, but they are all performed by the same model.

== Conclusion ==

This paper provides a new and effective algorithm for learning with noisy labels by using highly noisy data unlabelled data in a Semi-Supervised Learning framework. The DivideMix method trains two networks simultaneously and utilizes co-guessing and co-labeling effectively, therefore it is a robust approach to deal with noise in datasets. Also, the DivideMix method has been tested using various datasets with the results consistently being one of the best when compared to the state-of-the-art methods through extensive experiments.

Future work of DivideMix is to create an adaptation for other applications such as Natural Language Processing, and incorporating the ideas of SSL and LNL into DivideMix architecture.

== Critiques/ Insights ==

1. While combining both models makes the result better, the author did not show the relative time increase using this new combined methodology, which is very crucial considering training a large amount of data, especially for images. In addition, it seems that the author did not perform much on hyperparameters tuning for the combined model.

2. There is an interesting insight, which is when the noise ratio increases from 80% to 90%, the accuracy of DivideMix drops dramatically in both datasets.

3. There should be a further explanation of why the learning rate drops by a factor of 10 after 150 epochs.

4. It would be interesting to see the effectiveness of this method in other domains such as NLP. I am not aware of noisy training datasets available in NLP, but surely this is an important area to focus on, as much of the available data is collected from noisy sources from the web.

5. The paper implicitly assumes that a Gaussian mixture model (GMM) is sufficiently capable of identifying noise. Given the nature of a GMM, it would work well for noise that is distributed by a Gaussian distribution but for all other noise, it would probably be only asymptotic. The paper should present theoretical results on the noise that are Exponential, Rayleigh, etc. This is particularly important because the experiments were done on massive datasets, but they do not directly address the case when there are not many data points.

6. Comparing the training result on these benchmark datasets makes the algorithm quite comprehensive. This is a very insightful idea to maintain two networks to avoid bias from occurring.

7. The current benchmark accuracy for CIFAR-10 is 99.7, CIFAR-100 is 96.08 using EffNet-L2 in 2020. In 2019, CIFAR-10 is 99.37, CIFAR-100 is 93.51 using BiT-L.(based on paperswithcode.com) As there exists better methods, it would be nice to know why the authors chose these state-of-the-art methods to compare the test accuracy.

8. Another interesting observation is that DivideMix seems to maintain a similar accuracy while some methods give unstable results. That shows the reliability of the proposed algorithm.

9. It would be interesting to see if the drop in accuracy from increasing the noise ratio to 90% is a result of a low porportion or low number of clean labels. That is, would increasing the size of the training set but keeping the noise ratio at 90% result in increased accuracy?

10. For Ablation Study part, the paper also introduced a study on the Robustness of Testing Marking Methods Noise, including AUC for classification of clean/noisy samples of CIFAR-10 training data. And it shows that the method can effectively separate clean and noisy samples as training proceeds.

11. It is interesting how unlike common methods, the method in this paper discards the labels that are highly likely to be
noisy. It also utilizes the noisy samples as unlabeled data to regularize training in a SSL manner. This model can better distinguish and utilize noisy samples.

12. In the result section, the author gives us a comprehensive understanding of this algorithm by introducing the applications and the comparison of it with respect to similar methods. It would be attractive if in the application part, the author could indicate how the application relative to our daily life.

13. High quality data is very important for training Machine learning systems. Preparing the data to train ML systems requires data annotations which are prone to errors and are time-consuming. It is interesting to note how paper 14 and this paper aims to approach this problem from different perspectives. Paper 14 introduces CSL algorithm that learns from confused or Noisy data to find the tasks associated with them. And this paper proposes an algorithm that shows good performance when learning from noisy data. Hence both the papers seem to tackle similar problem and implementing the approaches described in both the papers when handling noisy data can be twice helpful.

14. Noise exists in all big data, and big data is what we are dealing with in real life nowadays. Having an effective noise eliminating method such as Dividemix is important to us.

15. The DivideMix consistently outperforms state-of-the-art methods across the given datasets, but how about some other potential datasets? If it can be given that it has advantages for a certain type of potential dataset, it will be a better discussion.

== References ==
Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Unsupervised
label noise modeling and loss correction. In ICML, pp. 312–321, 2019.

David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin
Raffel. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 2019.

Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach
to learning from noisy labels. In WACV, pp. 1215–1224, 2018.

User:Yktan

2020-12-01T02:42:58Z

Y87yu: /* Critiques/ Insights */

== Introduction ==

Much of the success in training deep neural networks (DNNs) is due to the collection of large datasets with human-annotated labels. However, human annotation is both a time-consuming and expensive task, especially for data that requires expertise such as medical data. Furthermore, certain datasets will be noisy due to the biases introduced by different annotators. Data obtained in large quantities through searching for images in search engines and data downloaded from social media sites (in a manner abiding by privacy and copyright laws) are especially noisy, since the labels are generally inferred from tags to save on human-annotation cost.

There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. Other LNL methods estimate a noise transition matrix and employ it to correct the loss function. An example of a popular loss correction approach is the bootstrapping loss approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples. The main limitation of these methods is that they do not perform well under high noise ratio and cause overfitting.

This paper introduces DivideMix, which combines approaches from LNL and SSL. One unique thing about DivideMix is that it discards sample labels that are highly likely to be noisy and leverages these noisy samples as unlabeled data instead. This prevents the model from overfitting and improves generalization performance. Key contributions of this work are:
1) Co-divide, which trains two networks simultaneously, aims to improve generalization and avoid confirmation bias.
2) During the SSL phase, an improvement is made on an existing method (MixMatch) by combining it with another method (MixUp).
3) Significant improvements to state-of-the-art results on multiple conditions are experimentally shown while using DivideMix. Extensive ablation study and qualitative results are also shown to examine the effect of different components.

== Motivation ==

While much has been achieved in training DNNs with noisy labels and SSL methods individually, not much progress has been made in exploring their underlying connections and building on top of the two approaches simultaneously.

Existing LNL methods aim to correct the loss function by:
<ol>
<li> Treating all samples equally and correcting loss explicitly or implicitly through relabelling of the noisy samples
<li> Reweighting training samples or separating clean and noisy samples, which results in correction of the loss function
</ol>

A few examples of LNL methods include:
<ol>
<li> Estimating the noise transition matrix, which denotes the probability of clean labels flipping to noisy labels, to correct the loss function
<li> Leveraging the predictions from DNNs to correct labels and using them to modify the loss
<li> Reweighting samples so that noisy labels contribute less to the loss
</ol>

However, these methods all have downsides: it is very challenging to correctly estimate the noise transition matrix in the first method; for the second method, DNNs tend to overfit to datasets with high noise ratio; and for the third method, we need to be able to identify clean samples, which has also proven to be challenging.

On the other hand, SSL methods mostly leverage unlabeled data using regularization to improve model performance. A recently proposed method, MixMatch, incorporates the two classes of regularization. These classes are consistency regularization which enforces the model to produce consistent predictions on augmented input data, and entropy minimization which encourages the model to give high-confidence predictions on unlabeled data, as well as MixUp regularization.

DivideMix partially adopts LNL in that it removes the labels that are highly likely to be noisy by using co-divide to avoid the confirmation bias problem. It then utilizes the noisy samples as unlabeled data and adopts an improved version of MixMatch (an SSL technique) which accounts for the label noise during the label co-refinement and co-guessing phase. By incorporating SSL techniques into LNL and taking the best of both worlds, DivideMix aims to produce highly promising results in training DNNs by better addressing the confirmation bias problem, more accurately distinguishing and utilizing noisy samples, and performing well under high levels of noise.

== Model Architecture and Algorithm ==

DivideMix leverages semi-supervised learning to achieve effective modeling. The sample is first split into a labeled set and an unlabeled set. This is achieved by fitting a Gaussian Mixture Model as a per-sample loss distribution. The unlabeled set is made up of data points with discarded labels deemed noisy. Then, to avoid confirmation bias, which is typical when a model is self-training, two models are being trained simultaneously to filter error for each other. This is done by dividing the data using one model and then training the other model. This algorithm, known as Co-divide, keeps the two networks from converging when training, which avoids the bias from occurring. Being diverged also offers the two networks distinct abilities to filter different types of error, making the model more robust to noise. Figure 1 describes the algorithm in graphical form.

[[File:ModelArchitecture.PNG | center]]

<div align="center">Figure 1: Model Architecture of DivideMix</div>

For each epoch, the network divides the dataset into a labeled set consisting of clean data, and an unlabeled set consisting of noisy data, which is then used as training data for the other network, where training is done in mini-batches. For each batch of the labelled samples, co-refinement is performed by using the ground truth label <math> y_b </math>, the predicted label <math> p_b </math>, and the posterior is used as the weight, <math> w_b </math>.

<center><math> \bar{y}_b = w_b y_b + (1-w_b) p_b </math></center>

Then, a sharpening function is implemented on this weighted sum to produce the estimate with reduced temperature, <math> \hat{y}_b </math>.

<center><math> \hat{y}_b=Sharpen(\bar{y}_b,T)={\bar{y}^{c{\frac{1}{T}}}_b}/{\sum_{c=1}^C\bar{y}^{c{\frac{1}{T}}}_b} </math>, for <math>c = 1, 2,..,C</math></center>

Using all these predicted labels, the unlabeled samples will then be assigned a "co-guessed" label, which should produce a more accurate prediction. Having calculated all these labels, MixMatch is applied to the combined mini-batch of labeled, <math> \hat{X} </math> and unlabeled data, <math> \hat{U} </math>, where, for a pair of samples and their labels, one new sample and new label is produced. More specifically, for a pair of samples <math> (x_1,x_2) </math> and their labels <math> (p_1,p_2) </math>, the mixed sample <math> (x',p') </math> is:

<center>
<math>
\begin{alignat}{2}

\lambda &\sim Beta(\alpha, \alpha) \\
\lambda ' &= max(\lambda, 1 - \lambda) \\
x' &= \lambda ' x_1 + (1 - \lambda ' ) x_2 \\
p' &= \lambda ' p_1 + (1 - \lambda ' ) p_2 \\

\end{alignat}
</math>
</center>

MixMatch transforms <math> \hat{X} </math> and <math> \hat{U} </math> into <math> X' </math> and <math> U' </math>. Then, the loss on <math> X' </math>, <math> L_X </math> (Cross-entropy loss) and the loss on <math> U' </math>, <math> L_U </math> (Mean Squared Error) are calculated. A regularization term, <math> L_{reg} </math>, is introduced to regularize the model's average output across all samples in the mini-batch. Then, the total loss is calculated as:

<center><math> L = L_X + \lambda_u L_U + \lambda_r L_{reg} </math></center>

where <math> \lambda_r </math> is set to 1, and <math> \lambda_u </math> is used to control the unsupervised loss.

Lastly, the stochastic gradient descent formula is updated with the calculated loss, <math> L </math>, and the estimated parameters, <math> \boldsymbol{ \theta } </math>.

The full algorithm is shown below. [[File:dividemix.jpg|600px| | center]]
<div align="center">Algorithm1: DivideMix. Line 4-8: co-divide; Line 17-18: label co-refinement; Line 20: co-guessing.</div>

The when the model is warmed up, it is trained on all data using standard cross-entropy to initially converge the model, but with a regulatory negative entropy term <math>\mathcal{H} = -\sum_{c}\text{p}^\text{c}_\text{model}(x;\theta)\log(\text{p}^\text{c}_\text{model}(x;\theta))</math>, where <math>\text{p}^\text{c}_\text{model}</math> is the softmax output probability for class c. This term penalizes confident predictions during the warm up to prevent overfitting to noise during the warm up, which can happen when there is asymmetric noise.

== Results ==
'''Applications'''

The method was validated using four benchmark datasets: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009) which contain 50K training images and 10K test images of size 32 × 32), Clothing1M (Xiao et al., 2015), and WebVision (Li et al., 2017a).
Two types of label noise are used in the experiments: symmetric and asymmetric.
An 18-layer PreAct Resnet (He et al., 2016) is trained using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. The network is trained for 300 epochs. The initial learning rate was set to 0.02 and reduced by a factor of 10 after 150 epochs. Before applying the Co-divide and MixMatch strategies, the models were first independently trained over the entire dataset using cross-entropy loss during a "warm-up" period. Initially, training the models in this way prepares a more regular distribution of losses to improve upon in subsequent epochs. The warm-up period is 10 epochs for CIFAR-10 and 30 epochs for CIFAR-100. For all CIFAR experiments, we use the same hyperparameters M = 2, T = 0.5, and α = 4. τ is set as 0.5 except for 90% noise ratio when it is set as 0.6.

'''Comparison of State-of-the-Art Methods'''

The effectiveness of DivideMix was shown by comparing the test accuracy with the most recent state-of-the-art methods:
Meta-Learning (Li et al., 2019) proposes a gradient-based method to find model parameters that are more noise-tolerant;
Joint-Optim (Tanaka et al., 2018) and P-correction (Yi & Wu, 2019) jointly optimize the sample labels and the network parameters;
M-correction (Arazo et al., 2019) models sample loss with BMM and apply MixUp.
The following are the results on CIFAR-10 and CIFAR-100 with different levels of symmetric label noise ranging from 20% to 90%. Both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs were recorded in the following table:

[[File:divideMixtable1.PNG | center]]

From table 1, the author noticed that none of these methods can consistently outperform others across different datasets. M-correction excels at symmetric noise, whereas Meta-Learning performs better for asymmetric noise. DivideMix outperforms state-of-the-art methods by a large margin across all noise ratios. The improvement is substantial (∼10% of accuracy) for the more challenging CIFAR-100 with high noise ratios.

DivideMix was compared with the state-of-the-art methods with the other two datasets: Clothing1M and WebVision. It also shows that DivideMix consistently outperforms state-of-the-art methods across all datasets with different types of label noise. For WebVision, DivideMix achieves more than 12% improvement in top-1 accuracy.

'''Ablation Study'''

The effect of removing different components to provide insights into what makes DivideMix successful. We analyze the results in Table 5 as follows.

[[File:DivideMixtable5.PNG | center]]

The authors combined self-divide with the original MixMatch as a naive baseline for using SLL in LNL.
They also find that both label refinement and input augmentation are beneficial for DivideMix. ''Label refinement'' is important for high noise ratio due because samples that are noisier would be incorrectly divided into the labeled set. ''Augmentation'' upgrades model performance by creating more reliable predictions and by achieving consistent regularization. In addition, the performance drop was seen in the ''DivideMix w/o co-training'' highlights the disadvantage of self-training; the model still has dataset division, label refinement and label guessing, but they are all performed by the same model.

== Conclusion ==

This paper provides a new and effective algorithm for learning with noisy labels by using highly noisy data unlabelled data in a Semi-Supervised Learning framework. The DivideMix method trains two networks simultaneously and utilizes co-guessing and co-labeling effectively, therefore it is a robust approach to deal with noise in datasets. Also, the DivideMix method has been tested using various datasets with the results consistently being one of the best when compared to the state-of-the-art methods through extensive experiments.

Future work of DivideMix is to create an adaptation for other applications such as Natural Language Processing, and incorporating the ideas of SSL and LNL into DivideMix architecture.

== Critiques/ Insights ==

1. While combining both models makes the result better, the author did not show the relative time increase using this new combined methodology, which is very crucial considering training a large amount of data, especially for images. In addition, it seems that the author did not perform much on hyperparameters tuning for the combined model.

2. There is an interesting insight, which is when the noise ratio increases from 80% to 90%, the accuracy of DivideMix drops dramatically in both datasets.

3. There should be a further explanation of why the learning rate drops by a factor of 10 after 150 epochs.

4. It would be interesting to see the effectiveness of this method in other domains such as NLP. I am not aware of noisy training datasets available in NLP, but surely this is an important area to focus on, as much of the available data is collected from noisy sources from the web.

5. The paper implicitly assumes that a Gaussian mixture model (GMM) is sufficiently capable of identifying noise. Given the nature of a GMM, it would work well for noise that is distributed by a Gaussian distribution but for all other noise, it would probably be only asymptotic. The paper should present theoretical results on the noise that are Exponential, Rayleigh, etc. This is particularly important because the experiments were done on massive datasets, but they do not directly address the case when there are not many data points.

6. Comparing the training result on these benchmark datasets makes the algorithm quite comprehensive. This is a very insightful idea to maintain two networks to avoid bias from occurring.

7. The current benchmark accuracy for CIFAR-10 is 99.7, CIFAR-100 is 96.08 using EffNet-L2 in 2020. In 2019, CIFAR-10 is 99.37, CIFAR-100 is 93.51 using BiT-L.(based on paperswithcode.com) As there exists better methods, it would be nice to know why the authors chose these state-of-the-art methods to compare the test accuracy.

8. Another interesting observation is that DivideMix seems to maintain a similar accuracy while some methods give unstable results. That shows the reliability of the proposed algorithm.

9. It would be interesting to see if the drop in accuracy from increasing the noise ratio to 90% is a result of a low porportion or low number of clean labels. That is, would increasing the size of the training set but keeping the noise ratio at 90% result in increased accuracy?

10. For Ablation Study part, the paper also introduced a study on the Robustness of Testing Marking Methods Noise, including AUC for classification of clean/noisy samples of CIFAR-10 training data. And it shows that the method can effectively separate clean and noisy samples as training proceeds.

11. It is interesting how unlike common methods, the method in this paper discards the labels that are highly likely to be
noisy. It also utilizes the noisy samples as unlabeled data to regularize training in a SSL manner. This model can better distinguish and utilize noisy samples.

12. In the result section, the author gives us a comprehensive understanding of this algorithm by introducing the applications and the comparison of it with respect to similar methods. It would be attractive if in the application part, the author could indicate how the application relative to our daily life.

13. High quality data is very important for training Machine learning systems. Preparing the data to train ML systems requires data annotations which are prone to errors and are time-consuming. It is interesting to note how paper 14 and this paper aims to approach this problem from different perspectives. Paper 14 introduces CSL algorithm that learns from confused or Noisy data to find the tasks associated with them. And this paper proposes an algorithm that shows good performance when learning from noisy data. Hence both the papers seem to tackle similar problem and implementing the approaches described in both the papers when handling noisy data can be twice helpful.

14. Noise exists in all big data, and big data is what we are dealing with in real life nowadays. Having an effective noise eliminating method such as Dividemix is important to us.
15. The DivideMix consistently outperforms state-of-the-art methods across the given datasets, but how about some other potential datasets? If it can be given that it has advantages for a certain type of potential dataset, it will be a better discussion.

== References ==
Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Unsupervised
label noise modeling and loss correction. In ICML, pp. 312–321, 2019.

David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin
Raffel. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 2019.

Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach
to learning from noisy labels. In WACV, pp. 1215–1224, 2018.

Semantic Relation Classification——via Convolution Neural Network

2020-12-01T02:35:19Z

Y87yu: /* References */

== Presented by ==
Rui Gong, Xinqi Ling, Di Ma,Xuetong Wang

== Introduction ==
A Semantic Relation can imply a relation between different words and a relation between different sentences or phrases. For example, the pair of words "white" and "snowy" can be synonyms, while "white" and "black" can be opposites. It can be used for recommendation systems like YouTube, and understanding sentiment analysis. The study of semantic analysis involves determining the exact meaning of a text. For example, the word "date", can have different meanings in different contexts, like a calendar "date", a "date" fruit, or a romantic "date".

One of the emerging trends of natural language technologies is their use for the humanities and sciences (Gbor et al., 2018). SemEval 2018 Task 7 mainly solves the problem of relation extraction and classification of two entities in the same sentence into 6 potential relations. The 6 relations are USAGE, RESULT, MODEL-FEATURE, PART WHOLE, TOPIC, and COMPARE.

SemEval 2018 Task 7 extracted data from 350 scientific paper abstracts, which has 1228 and 1248 annotated sentences for two tasks, respectively. For each data, an example sentence was chosen with its right and left sentences, as well as an indicator showing whether the relation is reserved, then a prediction is made.

Three models were used for the prediction: Linear Classifiers, Long Short-Term Memory(LSTM), and Convolutional Neural Networks (CNN). Linear Classifier achieves the goal of classification by making a classification decision based on the value of a linear combination of the characteristics. LSTM is an artificial recurrent neural network (RNN) architecture well suited to classifying, processing and making predictions based on time series data. In the end, the prediction based on the CNN model was ultimately submitted since it performed the best among all models. By using the learned custom word embedding function, the research team added a variant of negative sampling, thereby improving performance and surpassing ordinary CNN.

== Previous Work ==
SemEval 2010 Task 8 (Hendrickx et al., 2010) explored the classification of natural language relations and studied the 9 relations between word pairs. However, it is not designed for scientific text analysis, and their challenge differs from the challenge of this paper in its generalizability; this paper’s relations are specific to ACL papers (e.g. MODEL-FEATURE), whereas the 2010 relations are more general, and might necessitate more common-sense knowledge than the 2018 relations. Xu et al. (2015a) and Santos et al. (2015) both applied CNN with negative sampling to finish task7. The 2017 SemEval Task 10 also featured relation extraction within scientific publications.

== Algorithm ==

[[File:CNN.png|800px|center]]

This is the architecture of CNN. We first transform a sentence via Feature embeddings. Word representations are encoded by the column vector in the embedding matrix <math> W^{word} \in \mathbb{R}^{d^w \times |V|}</math>, where <math>V</math> is the vocabulary of the dataset. Each colummn is the word embedding vector for the <math>i^{th}</math> word in the vocabulary. This matrix is trainale during the optimization process and initialized by pre-trained emmbedding vectors. Basically, we transform each sentence into continuous word embeddings:

$$
(e^{w_i})
$$

And word position embeddings:
$$
(e^{wp_i}): e_i = [e^{w_i}, e^{wp_i}]
$$

In the word embeddings, we generated a vocabulary <math> V </math>. We will then generate an embedding word matrix based on the position of the word in the vocabulary. This matrix is trainable and needs to be initialized by pre-trained embedding vectors such as through GloVe or Word2Vec.

In the word position embeddings, we first need to input some words named ‘entities,’ and they are the key for the machine to determine the sentence’s relation. During this process, if we have two entities, we will use the relative position of them in the sentence to make the
embeddings. We will output two vectors, and one of them keeps track of the first entity relative position in the sentence ( we will make the entity recorded as 0, the former word recorded as -1 and the next one 1, etc. ). And the same procedure for the second entity. Finally, we will get two vectors concatenated as the position embedding. For example, in the sentence "the black '''cat''' jumped", the position embedding of "'''cat'''" is -2,-1,0,1.

After the embeddings, the model will transform the embedded sentence into a fix-sized representation of the whole sentence via the convolution layer. Finally, after the max-pooling to reduce the dimension of the output of the layers, we will get a score for each relation class via a linear transformation.

After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like
$$e=[e_{1},e_{2},\ldots,e_{N}]$$
and each entry represents a token of the word. Also, to apply
convolutional neural network, the subsets of features
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$
are given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to
produce a new feature, defiend as
$$c_{i}=\text{tanh}(W\cdot e_{i:i+k-1}+bias)$$
This process is applied to all subsets of features with length <math> k </math> starting
from the first one. Then a mapped feature factor is produced:
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$

The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.
With different weight filter, different mapped feature vectors can be obtained. Finally, the original
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.

Then, the score vector
$$s(x)=W^{classes}r_{x}$$
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.

To improve the performance, “Negative Sampling" was used. Given the trained data point
<math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative
distance (negative margin plus the second largest score) should be minimized. So the loss function is
$$L=\log(1+e^{\gamma(m^{+}-s(x)_{y})})+\log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))})$$
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total,
and 49,600 of them are unique.

== Results ==
In machine learning, the most important part is to tune the hyper-parameters. Unlike traditional hyper-parameter optimization, there are some
modifications to the model to increase performance on the test set. There are 5 modifications that we can apply:

'''1.''' Merged Training Sets. It combined two training sets to increase the data set
size and it improves the equality between classes to get better predictions.

'''2.''' Reversal Indicate Features. It added a binary feature.

'''3.''' Custom ACL Embeddings. It embedded a word vector to an ACL-specific
corps.

'''4.''' Context words. Within the sentence, it varies in size on a context window
around the entity-enclosed text.

'''5.''' Ensembling. It used different early stop and random initializations to improve
the predictions.

These modifications performances well on the training data and they are shown
in table 3.

[[File:table3.PNG|center]]

As we can see the best choice for this model is ensembling as the random initialization made the data more natural and avoided the overfit.
During the training process, there are some methods such that they can only
increase the score on the cross-validation test sets but hurt the performance on
the overall macro-F1 score. Thus, these methods were eventually ruled out.

[[File:table4.PNG|center]]

There are six submissions in total. Three for each training set and the result
is shown in figure 2.

The best submission for the training set 1.1 is the third submission which does not
use a separate cross-validation dataset. Instead, a constant number of
training epochs are run with cross-validation based on the training data.

This compares to the other submissions in which 10% of the training data are
removed to form the validation set (both with and without stratification).
Predictions for these are made when the model has the highest validation accuracy.

The best submission for the training set 1.2 is the submission which
extracted 10% of the training data to form the validation dataset. Predictions are
made when maximum accuracy is reached on the validation data.

All in all, early stopping cannot always be based on the accuracy of the validation set
since it cannot guarantee to get better performance on the real test set. Thus,
we have to try new approaches and combine them to see the prediction
results. Also, doing stratification will certainly improve the performance of
the test data.

== Conclusions ==
Throughout the process, we have experimented with linear classifiers, sequential random forest, LSTM, and CNN models with various variations applied, such as two models of attention, negative sampling, entity embedding or sentence-only embedding, etc.

Among all variations, vanilla CNN with negative sampling and ACL-embedding, without attention, has significantly better performance than all others. Attention-based pooling, up-sampling, and data augmentation are also tested, but they barely perform positive increment on the behavior.

== Critiques ==

- Applying this in news apps might be beneficial to improve readability by highlighting specific important sections.

- The data set come from 350 scientist papers, this could be more explained by the author on how those paper are selected and why those paper are important to discuss.

- In the section of previous work, the author mentioned 9 natural language relationships between the word pairs. Among them, 6 potential relationships are USAGE, RESULT, MODEL-FEATURE, PART WHOLE, TOPIC, and COMPARE. It would help the readers to better understand if all 9 relationships are listed in the summary.

-This topic is interesting and this application might be helpful for some educational websites to improve their website to help readers focus on the important points. I think it will be nice to use Latex to type the equation in the sentence rather than center the equation on the next line. I think it will be interesting to discuss applying this way to other languages such as Chinese, Japanese, etc.

- It would be a good idea if the authors can provide more details regarding ACL Embeddings and Context words modifications. Scores generated using these two modifications are quite close to the highest Ensembling modification generated score, which makes it a valid consideration to examine these two modifications in detail.

- This paper is dealing with a similar problem as 'Neural Speed Reading Via Skim-RNN', num 19 paper summary. It will be an interesting approach to compare these two models' performance based on the same dataset.

- I think it would be highly practical to implement this system as a page-rank system for search engines (such as google, bing, or other platforms like Facebook, Instagram, etc.) by finding the most prevalent information available in a search query and then matching the search to the related text which can be found on webpages. This could also be implemented in search bars on specific websites or locations as well.

- It would be interesting to see in the future how the model would behave if data not already trained was used. This pre-trained data as mentioned in the paper had noise included. Using cleaner data would give better results maybe.

- The selection of the training dataset, i.e. the abstracts of scientific papers, is an excellent idea since the abstracts usually contain more information than the body. But it may be also a good idea to train the model with the conclusions. Other than that, the result of applying the model to the body part of the scientific papers may show some interesting features of the model.

- From Table 4 we find that comparing with using a fixed number of training periods, early stopping based on the accuracy of the validation set does not guarantee better test set performance. The label ratio of the validation set is layered according to the training set, which helps to improve the performance of the test set. Whether it is beneficial to add entity embedding as an additional feature could be an interesting point of discussion.

- The author mentioned the use of CNNs for contextual understanding. NLP models based on CNNs have sometimes been inadequate for the use of understanding the context of words used in a body of text, being very prone to overfitting the context of the training data. It would help if more evidence was shown as to how the paper deals with this problem.

- Deep neural network with complex structure and huge parameter set is good at fitting the model. However, overfitting is a problem. Some strategies can be discussed like dropout strategies. Also, the choice of the most appropriate number of hidden layers is related to many factors like the scale of training corpus. This can also be talked about in the paper.

== References ==
[1] Diederik P Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980.

[2] DragomirR. Radev, Pradeep Muthukrishnan, Vahed
Qazvinian, and Amjad Abu-Jbara. 2013. The ACL
anthology network corpus. Language Resources
and Evaluation, pages 1–26.

[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013a. Efficient estimation of word
representations in vector space. arXiv preprint
arXiv:1301.3781.

[4] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,
and Jeff Dean. 2013b. Distributed representations
of words and phrases and their compositionality.
In Advances in neural information processing
systems, pages 3111–3119.

[5] Kata Gbor, Davide Buscaldi, Anne-Kathrin Schumann, Behrang QasemiZadeh, Hafa Zargayouna,
and Thierry Charnois. 2018. Semeval-2018 task 7:Semantic relation extraction and classification in scientific papers.
In Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval2018), New Orleans, LA, USA, June 2018.

Surround Vehicle Motion Prediction

2020-12-01T02:30:00Z

Y87yu: /* Prediction performance analysis and application to motion planning */

DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
Multi-lane turn intersections are the target roads in this paper. The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.

=== Motion predictor ===
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.

==== Encoder and decoder ====
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Sequence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}
<center>[[Image:Figure10.1_YanYu.png|500px|]]</center>

The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== Motion planning application ===
==== Case study of a multi-lane left turn scenario ====
The proposed method mimics a human driver better, by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle, even when the target vehicle was not following the intersection guideline.

==== Statistical analysis of motion planning application results ====
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behaviour of the subject vehicle.

<center>[[Image:Figure11_YanYu.png|500px|]]</center>

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math>
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed
model is efficient and safe.

== Conclusion ==
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.

There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.

It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.

The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.

It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.

The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].

[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.

Surround Vehicle Motion Prediction

2020-12-01T02:29:49Z

Y87yu: /* Prediction performance analysis and application to motion planning */

DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
Multi-lane turn intersections are the target roads in this paper. The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.

=== Motion predictor ===
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.

==== Encoder and decoder ====
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Sequence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}
<center>[[Image:Figure10.1_YanYu.png|500px|]]</center><center>[[Image:Figure10.1_YanYu.png|500px|]]</center>

The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== Motion planning application ===
==== Case study of a multi-lane left turn scenario ====
The proposed method mimics a human driver better, by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle, even when the target vehicle was not following the intersection guideline.

==== Statistical analysis of motion planning application results ====
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behaviour of the subject vehicle.

<center>[[Image:Figure11_YanYu.png|500px|]]</center>

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math>
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed
model is efficient and safe.

== Conclusion ==
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.

There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.

It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.

The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.

It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.

The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].

[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.

Surround Vehicle Motion Prediction

2020-12-01T02:28:17Z

Y87yu: /* Prediction performance analysis and application to motion planning */

DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
Multi-lane turn intersections are the target roads in this paper. The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.

=== Motion predictor ===
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.

==== Encoder and decoder ====
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Sequence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

<center>[[Image:Figure10.1_YanYu.png|500px|]]</center>

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

<center>[[Image:Figure10.2_YanYu.png|500px|] [Image:Figure10.3_YanYu.png|500px|]]</center>

The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== Motion planning application ===
==== Case study of a multi-lane left turn scenario ====
The proposed method mimics a human driver better, by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle, even when the target vehicle was not following the intersection guideline.

==== Statistical analysis of motion planning application results ====
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behaviour of the subject vehicle.

<center>[[Image:Figure11_YanYu.png|500px|]]</center>

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math>
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed
model is efficient and safe.

== Conclusion ==
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.

There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.

It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.

The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.

It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.

The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].

[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.

Surround Vehicle Motion Prediction

2020-12-01T02:27:11Z

Y87yu: /* Statistical analysis of motion planning application results */

DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
Multi-lane turn intersections are the target roads in this paper. The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.

=== Motion predictor ===
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.

==== Encoder and decoder ====
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Sequence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== Motion planning application ===
==== Case study of a multi-lane left turn scenario ====
The proposed method mimics a human driver better, by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle, even when the target vehicle was not following the intersection guideline.

==== Statistical analysis of motion planning application results ====
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behaviour of the subject vehicle.

<center>[[Image:Figure11_YanYu.png|500px|]]</center>

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math>
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed
model is efficient and safe.

== Conclusion ==
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.

There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.

It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.

The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.

It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.

The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].

[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.

File:Figure11 YanYu.png

2020-12-01T02:26:23Z

Y87yu:

File:Figure11.png

2020-12-01T02:25:52Z

Y87yu:

File:Figure10.3 YanYu.png

2020-12-01T02:25:39Z

Y87yu:

File:Figure10.2 YanYu.png

2020-12-01T02:25:20Z

Y87yu:

File:Figure10.1 YanYu.png

2020-12-01T02:24:58Z

Y87yu:

Surround Vehicle Motion Prediction

2020-12-01T02:20:51Z

Y87yu: /* Conclusion */

DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
Multi-lane turn intersections are the target roads in this paper. The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.

=== Motion predictor ===
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.

==== Encoder and decoder ====
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Sequence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== Motion planning application ===
==== Case study of a multi-lane left turn scenario ====
The proposed method mimics a human driver better, by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle, even when the target vehicle was not following the intersection guideline.

==== Statistical analysis of motion planning application results ====
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behaviour of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math>
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed
model is efficient and safe.

== Conclusion ==
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.

There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.

It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.

The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.

It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.

The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].

[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.

Surround Vehicle Motion Prediction

2020-12-01T00:21:49Z

Y87yu:

DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
Multi-lane turn intersections are the target roads in this paper. The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.

=== Motion predictor ===
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.

==== Encoder and decoder ====
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Sequence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== Motion planning application ===
==== Case study of a multi-lane left turn scenario ====
The proposed method mimics a human driver better, by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle, even when the target vehicle was not following the intersection guideline.

==== Statistical analysis of motion planning application results ====
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behaviour of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math>
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed
model is efficient and safe.

== Conclusion ==
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.

There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.

It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.

The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.

It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.

The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].

[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.

stat441F21

2020-11-30T23:54:07Z

Y87yu: /* Paper presentation */

Surround Vehicle Motion Prediction

2020-11-29T23:40:40Z

Y87yu: /* Squence length */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.

=== Motion predictor ===
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
==== Encoder and decoder ====
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Sequence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== Motion planning application ===
==== Case study of a multi-lane left turn scenario ====
The proposed method mimics a human driver better, by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle, even when the target vehicle was not following the intersection guideline.

==== Statistical analysis of motion planning application results ====
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behaviour of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math>
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed
model is efficient and safe.

== Conclusion ==
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.

There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.

It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].

[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.

Surround Vehicle Motion Prediction

2020-11-28T10:31:28Z

Y87yu: /* motion planning application */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on urban road, there are 3 categories for motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which often used in offline simulations.

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples and 4,998 evaluation data samples.

=== Motion predictor ===
This article propose a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
RNN is an artificial neural network, suitable for use with sequential data. RNN can also be used for time series data, where the pattern of the data depends on the time flow. It can contain feedback loops that allow activations to flow alternately in the loop.
LSTM can avoid the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network training improperly.The figure below shows the various layers of LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
==== Encoder and decoder ====
In this study, authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Squence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows a significantly less prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell shaped
cure with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== Motion planning application ===
==== Case study of a multi-lane left turn scenario ====
The proposed method mimic a human drivers better by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle even the target vehicle was not following the intersection guide line.

==== Statistical analysis of motion planning application results ====
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behavior of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math>
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, base algorithm
possesses limited ability to respond to different in-lane target behaviors in traffic flow. Hence, the proposed
model is efficient and safety.

== Conclusion ==
A surrounding vehicle motion predictor based on a LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with other three base algorithms (CV/Path, V_flow/Path and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of road. Why the LSTM-RNN is used, and the background of method are not stated clearly. There is lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance on different algorithms or some other traditional motion planning algorithms like KF.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

Surround Vehicle Motion Prediction

2020-11-28T10:30:12Z

Y87yu: /* accuracy analysis */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on urban road, there are 3 categories for motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which often used in offline simulations.

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples and 4,998 evaluation data samples.

=== Motion predictor ===
This article propose a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
RNN is an artificial neural network, suitable for use with sequential data. RNN can also be used for time series data, where the pattern of the data depends on the time flow. It can contain feedback loops that allow activations to flow alternately in the loop.
LSTM can avoid the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network training improperly.The figure below shows the various layers of LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
==== Encoder and decoder ====
In this study, authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Squence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows a significantly less prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell shaped
cure with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== motion planning application ===
==== case study of a multi-lane left turn scenario ====
The proposed method mimic a human drivers better by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle even the target vehicle was not following the intersection guide line.

==== statistical analysis of motion planning application results ====
The data is analysed from two perspectives, the time ot recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behavior of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error $a_{x, error} = a_{x, human} - a_{x, cmd}$. where $a_{x, human}$
and $a_{x, cmd}$ are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. $91.97\%$ of the acceleration error lies in the region $\pm 1 m/s^2$. Moreover, base algorithm
possesses limited ability to respond to different in-lane target behaviors in traffic flow. Hence, the proposed
model is efficient and safety.

== Conclusion ==
A surrounding vehicle motion predictor based on a LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with other three base algorithms (CV/Path, V_flow/Path and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of road. Why the LSTM-RNN is used, and the background of method are not stated clearly. There is lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance on different algorithms or some other traditional motion planning algorithms like KF.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

Surround Vehicle Motion Prediction

2020-11-28T10:27:45Z

Y87yu: /* Motion predictor */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on urban road, there are 3 categories for motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which often used in offline simulations.

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples and 4,998 evaluation data samples.

=== Motion predictor ===
This article propose a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
RNN is an artificial neural network, suitable for use with sequential data. RNN can also be used for time series data, where the pattern of the data depends on the time flow. It can contain feedback loops that allow activations to flow alternately in the loop.
LSTM can avoid the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network training improperly.The figure below shows the various layers of LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
==== Encoder and decoder ====
In this study, authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Squence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The $x$ position error $e_{x,T_p}$,
$y$ position error $e_{y,T_p}$, heading error $e_{\theta,T_p}$, and velocity error $e_{v,T_p}$ where $T_p$
denotes time $p$. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows a significantly less prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell shaped
cure with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, $e_{x,T_p}$, $e_{y,T_p}$, $e_{v,T_p}$ are bounded within
reasonable levels. For instant, the three-sigma range of $e_{y,T_p}$ is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== motion planning application ===
==== case study of a multi-lane left turn scenario ====
The proposed method mimic a human drivers better by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle even the target vehicle was not following the intersection guide line.

==== statistical analysis of motion planning application results ====
The data is analysed from two perspectives, the time ot recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behavior of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error $a_{x, error} = a_{x, human} - a_{x, cmd}$. where $a_{x, human}$
and $a_{x, cmd}$ are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. $91.97\%$ of the acceleration error lies in the region $\pm 1 m/s^2$. Moreover, base algorithm
possesses limited ability to respond to different in-lane target behaviors in traffic flow. Hence, the proposed
model is efficient and safety.

== Conclusion ==
A surrounding vehicle motion predictor based on a LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with other three base algorithms (CV/Path, V_flow/Path and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of road. Why the LSTM-RNN is used, and the background of method are not stated clearly. There is lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance on different algorithms or some other traditional motion planning algorithms like KF.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

Surround Vehicle Motion Prediction

2020-11-28T10:27:18Z

Y87yu: /* LSTM-RNN based motion predictor */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on urban road, there are 3 categories for motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which often used in offline simulations.

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples and 4,998 evaluation data samples.

=== Motion predictor ===
This article propose a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ==== RNN is an artificial neural network, suitable for use with sequential data. RNN can also be used for time series data, where the pattern of the data depends on the time flow. It can contain feedback loops that allow activations to flow alternately in the loop.
LSTM can avoid the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network training improperly.The figure below shows the various layers of LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ==== In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
==== Encoder and decoder ==== In this study, authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Squence length ==== The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The $x$ position error $e_{x,T_p}$,
$y$ position error $e_{y,T_p}$, heading error $e_{\theta,T_p}$, and velocity error $e_{v,T_p}$ where $T_p$
denotes time $p$. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows a significantly less prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell shaped
cure with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, $e_{x,T_p}$, $e_{y,T_p}$, $e_{v,T_p}$ are bounded within
reasonable levels. For instant, the three-sigma range of $e_{y,T_p}$ is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== motion planning application ===
==== case study of a multi-lane left turn scenario ====
The proposed method mimic a human drivers better by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle even the target vehicle was not following the intersection guide line.

==== statistical analysis of motion planning application results ====
The data is analysed from two perspectives, the time ot recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behavior of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error $a_{x, error} = a_{x, human} - a_{x, cmd}$. where $a_{x, human}$
and $a_{x, cmd}$ are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. $91.97\%$ of the acceleration error lies in the region $\pm 1 m/s^2$. Moreover, base algorithm
possesses limited ability to respond to different in-lane target behaviors in traffic flow. Hence, the proposed
model is efficient and safety.

== Conclusion ==
A surrounding vehicle motion predictor based on a LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with other three base algorithms (CV/Path, V_flow/Path and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of road. Why the LSTM-RNN is used, and the background of method are not stated clearly. There is lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance on different algorithms or some other traditional motion planning algorithms like KF.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

Surround Vehicle Motion Prediction

2020-11-28T10:26:54Z

Y87yu: /* Motion predictor */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on urban road, there are 3 categories for motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which often used in offline simulations.

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples and 4,998 evaluation data samples.

=== Motion predictor ===
This article propose a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture: ==== RNN is an artificial neural network, suitable for use with sequential data. RNN can also be used for time series data, where the pattern of the data depends on the time flow. It can contain feedback loops that allow activations to flow alternately in the loop.
LSTM can avoid the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network training improperly.The figure below shows the various layers of LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ==== In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
==== Encoder and decoder: ==== In this study, authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Squence length: ==== The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The $x$ position error $e_{x,T_p}$,
$y$ position error $e_{y,T_p}$, heading error $e_{\theta,T_p}$, and velocity error $e_{v,T_p}$ where $T_p$
denotes time $p$. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows a significantly less prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell shaped
cure with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, $e_{x,T_p}$, $e_{y,T_p}$, $e_{v,T_p}$ are bounded within
reasonable levels. For instant, the three-sigma range of $e_{y,T_p}$ is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== motion planning application ===
==== case study of a multi-lane left turn scenario ====
The proposed method mimic a human drivers better by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle even the target vehicle was not following the intersection guide line.

==== statistical analysis of motion planning application results ====
The data is analysed from two perspectives, the time ot recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behavior of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error $a_{x, error} = a_{x, human} - a_{x, cmd}$. where $a_{x, human}$
and $a_{x, cmd}$ are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. $91.97\%$ of the acceleration error lies in the region $\pm 1 m/s^2$. Moreover, base algorithm
possesses limited ability to respond to different in-lane target behaviors in traffic flow. Hence, the proposed
model is efficient and safety.

== Conclusion ==
A surrounding vehicle motion predictor based on a LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with other three base algorithms (CV/Path, V_flow/Path and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of road. Why the LSTM-RNN is used, and the background of method are not stated clearly. There is lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance on different algorithms or some other traditional motion planning algorithms like KF.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

Surround Vehicle Motion Prediction

2020-11-28T10:26:23Z

Y87yu: /* motion predictor */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on urban road, there are 3 categories for motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which often used in offline simulations.

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples and 4,998 evaluation data samples.

=== Motion predictor ===
This article propose a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

====Network architecture:==== RNN is an artificial neural network, suitable for use with sequential data. RNN can also be used for time series data, where the pattern of the data depends on the time flow. It can contain feedback loops that allow activations to flow alternately in the loop.
LSTM can avoid the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network training improperly.The figure below shows the various layers of LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

====Input and output features==== In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
====Encoder and decoder:==== In this study, authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
====Squence length:==== The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The $x$ position error $e_{x,T_p}$,
$y$ position error $e_{y,T_p}$, heading error $e_{\theta,T_p}$, and velocity error $e_{v,T_p}$ where $T_p$
denotes time $p$. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows a significantly less prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell shaped
cure with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, $e_{x,T_p}$, $e_{y,T_p}$, $e_{v,T_p}$ are bounded within
reasonable levels. For instant, the three-sigma range of $e_{y,T_p}$ is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== motion planning application ===
==== case study of a multi-lane left turn scenario ====
The proposed method mimic a human drivers better by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle even the target vehicle was not following the intersection guide line.

==== statistical analysis of motion planning application results ====
The data is analysed from two perspectives, the time ot recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behavior of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error $a_{x, error} = a_{x, human} - a_{x, cmd}$. where $a_{x, human}$
and $a_{x, cmd}$ are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. $91.97\%$ of the acceleration error lies in the region $\pm 1 m/s^2$. Moreover, base algorithm
possesses limited ability to respond to different in-lane target behaviors in traffic flow. Hence, the proposed
model is efficient and safety.

== Conclusion ==
A surrounding vehicle motion predictor based on a LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with other three base algorithms (CV/Path, V_flow/Path and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of road. Why the LSTM-RNN is used, and the background of method are not stated clearly. There is lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance on different algorithms or some other traditional motion planning algorithms like KF.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

Surround Vehicle Motion Prediction

2020-11-28T10:24:20Z

Y87yu: /* Motion planning based on surrounding vehicle motion prediction */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on urban road, there are 3 categories for motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which often used in offline simulations.

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples and 4,998 evaluation data samples.

=== motion predictor ===
This article propose a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

\textbf{Network architecture:} RNN is an artificial neural network, suitable for use with sequential data. RNN can also be used for time series data, where the pattern of the data depends on the time flow. It can contain feedback loops that allow activations to flow alternately in the loop.
LSTM can avoid the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network training improperly.The figure below shows the various layers of LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

\textbf{Input and output features:} In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
\textbf{Encoder and decoder:} In this study, authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
\textbf{Squence length:} The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The $x$ position error $e_{x,T_p}$,
$y$ position error $e_{y,T_p}$, heading error $e_{\theta,T_p}$, and velocity error $e_{v,T_p}$ where $T_p$
denotes time $p$. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows a significantly less prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell shaped
cure with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, $e_{x,T_p}$, $e_{y,T_p}$, $e_{v,T_p}$ are bounded within
reasonable levels. For instant, the three-sigma range of $e_{y,T_p}$ is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== motion planning application ===
==== case study of a multi-lane left turn scenario ====
The proposed method mimic a human drivers better by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle even the target vehicle was not following the intersection guide line.

==== statistical analysis of motion planning application results ====
The data is analysed from two perspectives, the time ot recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behavior of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error $a_{x, error} = a_{x, human} - a_{x, cmd}$. where $a_{x, human}$
and $a_{x, cmd}$ are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. $91.97\%$ of the acceleration error lies in the region $\pm 1 m/s^2$. Moreover, base algorithm
possesses limited ability to respond to different in-lane target behaviors in traffic flow. Hence, the proposed
model is efficient and safety.

== Conclusion ==
A surrounding vehicle motion predictor based on a LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with other three base algorithms (CV/Path, V_flow/Path and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of road. Why the LSTM-RNN is used, and the background of method are not stated clearly. There is lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance on different algorithms or some other traditional motion planning algorithms like KF.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

Surround Vehicle Motion Prediction

2020-11-28T10:23:59Z

Y87yu: /* Motion planning based on surrounding vehicle motion prediction */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on urban road, there are 3 categories for motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which often used in offline simulations.

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples and 4,998 evaluation data samples.

=== motion predictor ===
This article propose a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

\textbf{Network architecture:} RNN is an artificial neural network, suitable for use with sequential data. RNN can also be used for time series data, where the pattern of the data depends on the time flow. It can contain feedback loops that allow activations to flow alternately in the loop.
LSTM can avoid the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network training improperly.The figure below shows the various layers of LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

\textbf{Input and output features:} In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
\textbf{Encoder and decoder:} In this study, authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
\textbf{Squence length:} The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where $v_{x, limit}$ are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The $x$ position error $e_{x,T_p}$,
$y$ position error $e_{y,T_p}$, heading error $e_{\theta,T_p}$, and velocity error $e_{v,T_p}$ where $T_p$
denotes time $p$. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows a significantly less prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell shaped
cure with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, $e_{x,T_p}$, $e_{y,T_p}$, $e_{v,T_p}$ are bounded within
reasonable levels. For instant, the three-sigma range of $e_{y,T_p}$ is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== motion planning application ===
==== case study of a multi-lane left turn scenario ====
The proposed method mimic a human drivers better by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle even the target vehicle was not following the intersection guide line.

==== statistical analysis of motion planning application results ====
The data is analysed from two perspectives, the time ot recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behavior of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error $a_{x, error} = a_{x, human} - a_{x, cmd}$. where $a_{x, human}$
and $a_{x, cmd}$ are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. $91.97\%$ of the acceleration error lies in the region $\pm 1 m/s^2$. Moreover, base algorithm
possesses limited ability to respond to different in-lane target behaviors in traffic flow. Hence, the proposed
model is efficient and safety.

== Conclusion ==
A surrounding vehicle motion predictor based on a LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with other three base algorithms (CV/Path, V_flow/Path and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of road. Why the LSTM-RNN is used, and the background of method are not stated clearly. There is lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance on different algorithms or some other traditional motion planning algorithms like KF.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

Surround Vehicle Motion Prediction

2020-11-28T10:20:30Z

Y87yu: /* accuracy analysis */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on urban road, there are 3 categories for motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which often used in offline simulations.

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples and 4,998 evaluation data samples.

=== motion predictor ===
This article propose a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

\textbf{Network architecture:} RNN is an artificial neural network, suitable for use with sequential data. RNN can also be used for time series data, where the pattern of the data depends on the time flow. It can contain feedback loops that allow activations to flow alternately in the loop.
LSTM can avoid the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network training improperly.The figure below shows the various layers of LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

\textbf{Input and output features:} In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
\textbf{Encoder and decoder:} In this study, authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
\textbf{Squence length:} The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where $k$ and $t$ are the prediction step index and time index, respectively; $x(k|t)$ and $x_{ref} (k|t)$ are the states and reference of the MPC problem, respectively; $x(k|t)$ is composed of travel distance px and longitudinal velocity vx; $x_{ref} (k|t)$ consists of reference travel distance $p_{x,ref}$ and reference longitudinal velocity $v_{x,ref}$ ; $u(k|t)$ is the control input, which is the longitudinal acceleration command; $N_p$ is the prediction horizon; and Q, R, and $R_{\Delta \mu}$ are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where $v_{x, limit}$ are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The $x$ position error $e_{x,T_p}$,
$y$ position error $e_{y,T_p}$, heading error $e_{\theta,T_p}$, and velocity error $e_{v,T_p}$ where $T_p$
denotes time $p$. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows a significantly less prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell shaped
cure with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, $e_{x,T_p}$, $e_{y,T_p}$, $e_{v,T_p}$ are bounded within
reasonable levels. For instant, the three-sigma range of $e_{y,T_p}$ is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== motion planning application ===
==== case study of a multi-lane left turn scenario ====
The proposed method mimic a human drivers better by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle even the target vehicle was not following the intersection guide line.

==== statistical analysis of motion planning application results ====
The data is analysed from two perspectives, the time ot recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behavior of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error $a_{x, error} = a_{x, human} - a_{x, cmd}$. where $a_{x, human}$
and $a_{x, cmd}$ are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. $91.97\%$ of the acceleration error lies in the region $\pm 1 m/s^2$. Moreover, base algorithm
possesses limited ability to respond to different in-lane target behaviors in traffic flow. Hence, the proposed
model is efficient and safety.

== Conclusion ==
A surrounding vehicle motion predictor based on a LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with other three base algorithms (CV/Path, V_flow/Path and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of road. Why the LSTM-RNN is used, and the background of method are not stated clearly. There is lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance on different algorithms or some other traditional motion planning algorithms like KF.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

Surround Vehicle Motion Prediction

2020-11-28T10:19:28Z

Y87yu: /* accuracy analysis */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on urban road, there are 3 categories for motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which often used in offline simulations.

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples and 4,998 evaluation data samples.

=== motion predictor ===
This article propose a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

\textbf{Network architecture:} RNN is an artificial neural network, suitable for use with sequential data. RNN can also be used for time series data, where the pattern of the data depends on the time flow. It can contain feedback loops that allow activations to flow alternately in the loop.
LSTM can avoid the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network training improperly.The figure below shows the various layers of LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

\textbf{Input and output features:} In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
\textbf{Encoder and decoder:} In this study, authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
\textbf{Squence length:} The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where $k$ and $t$ are the prediction step index and time index, respectively; $x(k|t)$ and $x_{ref} (k|t)$ are the states and reference of the MPC problem, respectively; $x(k|t)$ is composed of travel distance px and longitudinal velocity vx; $x_{ref} (k|t)$ consists of reference travel distance $p_{x,ref}$ and reference longitudinal velocity $v_{x,ref}$ ; $u(k|t)$ is the control input, which is the longitudinal acceleration command; $N_p$ is the prediction horizon; and Q, R, and $R_{\Delta \mu}$ are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where $v_{x, limit}$ are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The $x$ position error $e_{x,T_p}$,
$y$ position error $e_{y,T_p}$, heading error $e_{\theta,T_p}$, and velocity error $e_{v,T_p}$ where $T_p$
denotes time $p$. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=&p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=&p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=&\theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=&v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows a significantly less prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell shaped
cure with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, $e_{x,T_p}$, $e_{y,T_p}$, $e_{v,T_p}$ are bounded within
reasonable levels. For instant, the three-sigma range of $e_{y,T_p}$ is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== motion planning application ===
==== case study of a multi-lane left turn scenario ====
The proposed method mimic a human drivers better by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle even the target vehicle was not following the intersection guide line.

==== statistical analysis of motion planning application results ====
The data is analysed from two perspectives, the time ot recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behavior of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error $a_{x, error} = a_{x, human} - a_{x, cmd}$. where $a_{x, human}$
and $a_{x, cmd}$ are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. $91.97\%$ of the acceleration error lies in the region $\pm 1 m/s^2$. Moreover, base algorithm
possesses limited ability to respond to different in-lane target behaviors in traffic flow. Hence, the proposed
model is efficient and safety.

== Conclusion ==
A surrounding vehicle motion predictor based on a LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with other three base algorithms (CV/Path, V_flow/Path and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of road. Why the LSTM-RNN is used, and the background of method are not stated clearly. There is lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance on different algorithms or some other traditional motion planning algorithms like KF.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

Surround Vehicle Motion Prediction

2020-11-28T10:18:31Z

Y87yu: /* motion predictor */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on urban road, there are 3 categories for motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which often used in offline simulations.

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples and 4,998 evaluation data samples.

=== motion predictor ===
This article propose a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

\textbf{Network architecture:} RNN is an artificial neural network, suitable for use with sequential data. RNN can also be used for time series data, where the pattern of the data depends on the time flow. It can contain feedback loops that allow activations to flow alternately in the loop.
LSTM can avoid the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network training improperly.The figure below shows the various layers of LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

\textbf{Input and output features:} In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
\textbf{Encoder and decoder:} In this study, authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
\textbf{Squence length:} The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where $k$ and $t$ are the prediction step index and time index, respectively; $x(k|t)$ and $x_{ref} (k|t)$ are the states and reference of the MPC problem, respectively; $x(k|t)$ is composed of travel distance px and longitudinal velocity vx; $x_{ref} (k|t)$ consists of reference travel distance $p_{x,ref}$ and reference longitudinal velocity $v_{x,ref}$ ; $u(k|t)$ is the control input, which is the longitudinal acceleration command; $N_p$ is the prediction horizon; and Q, R, and $R_{\Delta \mu}$ are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where $v_{x, limit}$ are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The $x$ position error $e_{x,T_p}$,
$y$ position error $e_{y,T_p}$, heading error $e_{\theta,T_p}$, and velocity error $e_{v,T_p}$ where $T_p$
denotes time $p$. These four errors are defined as follows:

\begin{align*}
e_{x,Tp}=&p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=&p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=&\theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=&v_{Tp} -\hat {v}_{Tp}
\end{align*}

The proposed model shows a significantly less prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell shaped
cure with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, $e_{x,T_p}$, $e_{y,T_p}$, $e_{v,T_p}$ are bounded within
reasonable levels. For instant, the three-sigma range of $e_{y,T_p}$ is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== motion planning application ===
==== case study of a multi-lane left turn scenario ====
The proposed method mimic a human drivers better by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle even the target vehicle was not following the intersection guide line.

==== statistical analysis of motion planning application results ====
The data is analysed from two perspectives, the time ot recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behavior of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error $a_{x, error} = a_{x, human} - a_{x, cmd}$. where $a_{x, human}$
and $a_{x, cmd}$ are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. $91.97\%$ of the acceleration error lies in the region $\pm 1 m/s^2$. Moreover, base algorithm
possesses limited ability to respond to different in-lane target behaviors in traffic flow. Hence, the proposed
model is efficient and safety.

== Conclusion ==
A surrounding vehicle motion predictor based on a LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with other three base algorithms (CV/Path, V_flow/Path and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of road. Why the LSTM-RNN is used, and the background of method are not stated clearly. There is lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance on different algorithms or some other traditional motion planning algorithms like KF.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.