statwiki - User contributions [US]

Neural Speed Reading via Skim-RNN

2020-11-30T01:47:52Z

D287zhan: /* Critiques */

== Group ==

Mingyan Dai, Jerry Huang, Daniel Jiang

== Introduction ==

Recurrent Neural Network (RNN) is a class of artificial neural networks where the connection between nodes form a directed graph along with time series and has time dynamic behavior. RNN is derived from a feedforward neural network and can use its memory to process variable-length input sequences. This makes it suitable for tasks such as unsegmented, connected handwriting recognition, and speech recognition.

In Natural Language Processing, Recurrent Neural Network (RNN) is a common architecture used to sequentially ‘read’ input tokens and output a distributed representation for each token. By recurrently updating the hidden state of the neural network, an RNN can inherently require the same computational cost across time. However, when it comes to processing input tokens, some tokens, when being compared to others, are less important to the overall representation of a piece of text or query. For example, in the application of RNN to the question answering problem, it is not uncommon to encounter parts of a passage that are irrelevant to answering the query.

LSTM-Jump (Yu et al., 2017), a variant of LSTMs, was introduced to improve efficiency by skipping multiple tokens at a given step. In contrast, Skim-RNN takes advantage of 'skimming' rather than 'skipping tokens'. This paper demonstrates that skimming achieves higher accuracy compared to skipping tokens, implying that paying attention to unimportant tokens is better than completely ignoring them.

== Model ==

In this paper, the authors introduce a model called 'skim-RNN', which takes advantage of ‘skimming’ less important tokens or pieces of text rather than ‘skipping’ them entirely. This models the human ability to skim through passages, or to spend less time reading parts that do not affect the reader’s main objective. While this leads to a loss in the comprehension rate of the text [1], it greatly reduces the amount of time spent reading by not focusing on areas that will not significantly affect efficiency when it comes to the reader's objective.

'Skim-RNN' works by rapidly determining the significance of each input and spending less time processing unimportant input tokens by using a smaller RNN to update only a fraction of the hidden state. When the decision is to ‘fully read’, that is to not skim the text, Skim-RNN updates the entire hidden state with the default RNN cell. Since the hard decision function (‘skim’ or ‘read’) is non-differentiable, the authors use a gumbel-softmax [2] to estimate the gradient of the function, rather than traditional methods such as REINFORCE (policy gradient)[3]. The switching mechanism between the two RNN cells enables Skim-RNN to reduce the total number of float operations (Flop reduction, or Flop-R). When the skimming rate is high, which often leads to faster inference on CPUs, which makes it very useful for large-scale products and small devices.

The Skim-RNN has the same input and output interfaces as standard RNNs, so it can be conveniently used to speed up RNNs in existing models. In addition, the speed of Skim-RNN can be dynamically controlled at inference time by adjusting a parameter for the threshold for the ‘skim’ decision.

=== Related Works ===

As the popularity of neural networks has grown, significant attention has been given to make them faster and lighter. In particular, relevant work focused on reducing the computational cost of recurrent neural networks has been carried out by several other related works. For example, LSTM-Jump (You et al., 2017) [8] models aim to speed up run times by skipping certain input tokens, as opposed to skimming them. Choi et al. (2017)[9] proposed a model which uses a CNN-based sentence classifier to determine the most relevant sentence(s) to the question and then uses an RNN-based question-answering model. This model focuses on reducing GPU run-times (as opposed to Skim-RNN which focuses on minimizing CPU-time or Flop), and is also focused only on question answering.

=== Implementation ===

A Skim-RNN consists of two RNN cells, a default (big) RNN cell of hidden state size <math>d</math> and small RNN cell of hidden state size <math>d'</math>, where <math>d</math> and <math>d'</math> are parameters defined by the user and <math>d' \ll d</math>. This follows the fact that there should be a small RNN cell defined for when text is meant to be skimmed and a larger one for when the text should be processed as normal.

Each RNN cell will have its own set of weights and bias as well as be any variant of an RNN. There is no requirement on how the RNN itself is structured, rather the core concept is to allow the model to dynamically make a decision as to which cell to use when processing input tokens. Note that skipping text can be incorporated by setting <math>d'</math> to 0, which means that when the input token is deemed irrelevant to a query or classification task, nothing about the information in the token is retained within the model.

Experimental results suggest that this model is faster than using a single large RNN to process all input tokens, as the smaller RNN requires fewer floating-point operations to process the token. Additionally, higher accuracy and computational efficiency are achieved.

==== Inference ====

At each time step <math>t</math>, the Skim-RNN unit takes in an input <math>{\bf x}_t \in \mathbb{R}^d</math> as well as the previous hidden state <math>{\bf h}_{t-1} \in \mathbb{R}^d</math> and outputs the new state <math>{\bf h}_t </math> (although the dimensions of the hidden state and input are the same, this process holds for different sizes as well). In the Skim-RNN, there is a hard decision that needs to be made whether to read or skim the input, although there could be potential to include options for multiple levels of skimming.

The decision to read or skim is done using a multinomial random variable <math>Q_t</math> over the probability distribution of choices <math>{\bf p}_t</math>, where

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>{\bf p}_t = \text{softmax}(\alpha({\bf x}_t, {\bf h}_{t-1})) = \text{softmax}({\bf W}[{\bf x}_t; {\bf h}_{t-1}]+{\bf b}) \in \mathbb{R}^k</math>
</div>

where <math>{\bf W} \in \mathbb{R}^{k \times 2d}</math>, <math>{\bf b} \in \mathbb{R}^{k}</math> are weights to be learned and <math>[{\bf x}_t; {\bf h}_{t-1}] \in \mathbb{R}^{2d}</math> indicates the row concatenation of the two vectors. In this case, <math> \alpha </math> can have any form as long as the complexity of calculating it is less than <math> O(d^2)</math>. Letting <math>{\bf p}^1_t</math> indicate the probability for fully reading and <math>{\bf p}^2_t</math> indicate the probability for skimming the input at time <math> t</math>, it follows that the decision to read or skim can be modelled using a random variable <math> Q_t</math> by sampling from the distribution <math>{\bf p}_t</math> and
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>Q_t \sim \text{Multinomial}({\bf p}_t)</math>
</div>

Without loss of generality, we can define <math> Q_t = 1</math> to indicate that the input will be read while <math> Q_t = 2</math> indicates that it will be skimmed. Reading requires applying the full RNN on the input as well as the previous hidden state to modify the entire hidden state while skimming only modifies part of the prior hidden state.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
{\bf h}_t = \begin{cases}
f({\bf x}_t, {\bf h}_{t-1}) & Q_t = 1\\
[f'({\bf x}_t, {\bf h}_{t-1});{\bf h}_{t-1}(d'+1:d)] & Q_t = 2
\end{cases}
</math>
</div>

where <math> f </math> is a full RNN with output of dimension <math>d</math> and <math>f'</math> is a smaller RNN with <math>d'</math>-dimensional output. This has advantage that when the model decides to skim, then the computational complexity of that step is only <math>O(d'd)</math>, which is much smaller than <math>O(d^2)</math> due to previously defining <math> d' \ll d</math>.

==== Training ====

Since the expected loss/error of the model is a random variable that depends on the sequence of random variables <math> \{Q_t\} </math>, the loss is minimized with respect to the distribution of the variables. Defining the loss to be minimized while conditioning on a particular sequence of decisions

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
L(\theta\vert Q)
</math>
</div>
where <math>Q=Q_1\dots Q_T</math> is a sequence of decisions of length <math>T</math>, then the expected loss over the distribution of the sequence of decisions is

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
\mathbb{E}[L(\theta)] = \sum_{Q} L(\theta\vert Q)P(Q) = \sum_Q L(\theta\vert Q) \Pi_j {\bf p}_j^{Q_j}
</math>
</div>

Since calculating <math>\delta \mathbb{E}_{Q_t}[L(\theta)]</math> directly is rather infeasible, it is possible to approximate the gradients with a gumbel-softmax distribution [2]. Reparameterizing <math> {\bf p}_t</math> as <math> {\bf r}_t</math>, then the back-propagation can flow to <math> {\bf p}_t</math> without being blocked by <math> Q_t</math> and the approximation can arbitrarily approach <math> Q_t</math> by controlling the parameters. The reparameterized distribution is therefore

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
{\bf r}_t^i = \frac{\text{exp}(\log({\bf p}_t^i + {g_t}^i)/\tau)}{\sum_j\text{exp}(\log({\bf p}_t^j + {g_t}^j)/\tau)}
</math>
</div>

where <math>{g_t}^i</math> is an independent sample from a <math>\text{Gumbel}(0, 1) = -\log(-\log(\text{Uniform}(0, 1))</math> random variable and <math>\tau</math> is a parameter that represents a temperature. Then it can be rewritten that

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
{\bf h}_t = \sum_i {\bf r}_t^i {\bf \tilde{h}}_t
</math>
</div>

where <math>{\bf \tilde{h}}_t</math> is the previous equation for <math>{\bf h}_t</math>. The temperature parameter gradually decreases with time, and <math>{\bf r}_t^i</math> becomes more discrete as it approaches 0.

A final addition to the model is to encourage skimming when possible. Therefore an extra term related to the negative log probability of skimming and the sequence length. Therefore the final loss function used for the model is denoted by

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;">
<math>
L'(\theta) =L(\theta) + \gamma \cdot\frac{1}{T} \sum_i -\log({\bf \tilde{p}}^i_t)
</math>
</div>
where <math> \gamma </math> is a parameter used to control the ratio between the main loss function and the negative log probability of skimming.

== Experiment ==

The effectiveness of Skim-RNN was measured in terms of accuracy and float operation reduction on four classification tasks and a question-answering task. These tasks were chosen because they do not require one’s full attention to every detail of the text, but rather ask for capturing the high-level information (classification) or focusing on a specific portion (QA) of the text, which a common context for speed reading. The tasks themselves are listed in the table below.

[[File:Table1SkimRNN.png|center|1000px]]

=== Classification Tasks ===

In a language classification task, the input was a sequence of words and the output was the vector of categorical probabilities. Each word is embedded into a <math>d</math>-dimensional vector. We initialize the vector with GloVe [4] to form representations of the words and use those as the inputs for a long short-term memory (LSTM) architecture. A linear transformation on the last hidden state of the LSTM and then a softmax function was applied to obtain the classification probabilities. Adam [5] was used for optimization, with an initial learning rate of 0.0001. For Skim-LSTM, <math>\tau = \max(0.5, exp(−rn))</math> where <math>r = 1e-4</math> and <math>n</math> is the global training step, following [2]. We experiment on different sizes of big LSTM (<math>d \in \{100, 200\}</math>) and small LSTM (<math>d' \in \{5, 10, 20\}</math>) and the ratio between the model loss and the skim loss (<math>\gamma\in \{0.01, 0.02\}</math>) for Skim-LSTM. The batch sizes used were 32 for SST and Rotten Tomatoes, and 128 for others. For all models, early stopping was used when the validation accuracy did not increase for 3000 global steps.

==== Results ====

[[File:Table2SkimRNN.png|center|1000px]]

[[File:Figure2SkimRNN.png|center|1000px]]

Table 2 shows the accuracy and computational cost of the Skim-RNN model compared with other standard models. It is evident that the Skim-RNN model produces a speed-up on the computational complexity of the task while maintaining a high degree of accuracy. Also, it is interesting to know that the accuracy improvement over LSTM could be due to the increased stability of the hidden state, as the majority of the hidden state is not updated when skimming. Meanwhile, figure 2 demonstrates the effect of varying the size of the small hidden state as well as the parameter <math>\gamma</math> on the accuracy and computational cost.

[[File:Table3SkimRNN.png|center|1000px]]

Table 3 shows an example of a classification task over a IMDb dataset, where Skim-RNN with <math>d = 200</math>, <math>d' = 10</math>, and <math>\gamma = 0.01</math> correctly classifies it with a high skimming rate (92%). The goal was to classify the review as either positive or negative. The black words are skimmed, and the blue words are fully read. The skimmed words are clearly irrelevant and the model learns to only carefully read the important words, such as ‘liked’, ‘dreadful’, and ‘tiresome’.

=== Question Answering Task ===

In the Stanford Question Answering Dataset, the task was to locate the answer span for a given question in a context paragraph. The effectiveness of Skim-RNN for SQuAD was evaluated using two different models: LSTM+Attention and BiDAF [6]. The first model was inspired by most then-present QA systems consisting of multiple LSTM layers and an attention mechanism. This type of model is complex enough to reach reasonable accuracy on the dataset and simple enough to run well-controlled analyses for the Skim-RNN. The second model was an open-source model designed for SQuAD, used primarily to show that Skim-RNN could replace RNN in existing complex systems.

==== Training ====

Adam was used with an initial learning rate of 0.0005. For stable training, the model was pre-trained with a standard LSTM for the first 5k steps, and then fine-tuned with Skim-LSTM.

==== Results ====

[[File:Table4SkimRNN.png|center|1000px]]

Table 4 shows the accuracy (F1 and EM) of LSTM+Attention and Skim-LSTM+Attention models as well as VCRNN [7]. It can be observed from the table that the skimming models achieve higher or similar accuracy scores compared to the non-skimming models while also reducing the computational cost by more than 1.4 times. In addition, decreasing layers (1 layer) or hidden size (<math>d=5</math>) improved the computational cost but significantly decreases the accuracy compared to skimming. The table also shows that replacing LSTM with Skim-LSTM in an existing complex model (BiDAF) stably gives reduced computational cost without losing much accuracy (only 0.2% drop from 77.3% of BiDAF to 77.1% of Sk-BiDAF with <math>\gamma = 0.001</math>).

An explanation for this trend that was given is that the model is more confident about which tokens are important in the second layer. Second, higher <math>\gamma</math> values lead to a higher skimming rate, which agrees with its intended functionality.

Figure 4 shows the F1 score of LSTM+Attention model using standard LSTM and Skim LSTM, sorted in ascending order by Flop-R (computational cost). While models tend to perform better with larger computational cost, Skim LSTM (Red) outperforms standard LSTM (Blue) with a comparable computational cost. It can also be seen that the computational cost of Skim-LSTM is more stable across different configurations and computational cost. Moreover, increasing the value of <math>\gamma</math> for Skim-LSTM gradually increases the skipping rate and Flop-R, while it also led to reduced accuracy.

=== Runtime Benchmark ===

[[File:Figure6SkimRNN.png|center|1000px]]

The details of the runtime benchmarks for LSTM and Skim-LSTM, which are used to estimate the speedup of Skim-LSTM-based models in the experiments, are also discussed. A CPU-based benchmark was assumed to be the default benchmark, which has a direct correlation with the number of float operations that can be performed per second. As mentioned previously, the speed-up results in Table 2 (as well as Figure 7) are benchmarked using Python (NumPy), instead of popular frameworks such as TensorFlow or PyTorch.

Figure 7 shows the relative speed gain of Skim-LSTM compared to standard LSTM with varying hidden state size and skim rate. NumPy was used, with the inferences run on a single thread of CPU. The ratio between the reduction of the number of float operations (Flop-R) of LSTM and Skim-LSTM was plotted, with the ratio acting as a theoretical upper bound of the speed gain on CPUs. From here, it can be noticed that there is a gap between the actual gain and the theoretical gain in speed, with the gap being larger with more overhead of the framework or more parallelization. The gap also decreases as the hidden state size increases because the overhead becomes negligible with very large matrix operations. This indicates that Skim-RNN provides greater benefits for RNNs with larger hidden state size. However, combining Skim-RNN with a CPU-based framework can lead to substantially lower latency than GPUs.

== Results ==

The results clearly indicate that the Skim-RNN model provides features that are suitable for general reading tasks, which include classification and question answering. While the tables indicate that minor losses in accuracy occasionally did result when parameters were set at specific values, they were minor and were acceptable given the improvement in runtime.

An important advantage of Skim-RNN is that the skim rate (and thus computational cost) can be dynamically controlled at inference time by adjusting the threshold for
‘skim’ decision probability <math>{\bf p}^1_t</math>. Figure 5 shows the trade-off between the accuracy and computational cost for two settings, confirming the importance of skimming (<math>d' > 0</math>) compared to skipping (<math>d' = 0</math>).

Figure 6 shows that the model does not skim when the input seems to be relevant to answering the question, which was as expected by the design of the model. In addition, the LSTM in the second layer skims more than that in the first layer mainly because the second layer is more confident about the importance of each token.

== Conclusion ==

A Skim-RNN can offer better latency results on a CPU compared to a standard RNN on a GPU, with lower computational cost, as demonstrated through the results of this study. Compared to RNN, Skim-RNN takes the advantage of "skimming" rather than "reading", spends less time on parts of the input that is unimportant. Future work (as stated by the authors) involves using Skim-RNN for applications that require much higher hidden state size, such as video understanding, and using multiple small RNN cells for varying degrees of skimming. Further, since it has the same input and output interface as a regular RNN, it can replace RNNs in existing applications.

== Critiques ==

1. It seems like Skim-RNN is using the not full RNN of processing words that are not important, thus it can increase speed in some very particular circumstances (ie, only small networks). The extra model complexity did slow down the speed while trying to "optimizing" the efficiency and sacrifice part of accuracy while doing so. It is only trying to target a very specific situation (classification/question-answering) and made comparisons only with the baseline LSTM model. It would be definitely more persuasive if the model can compare with some of the state of art neural network models.

2. This model of Skim-RNN is pretty good to extract binary classification type of text, thus it would be interesting for this to be applied to stock market news analysis. For example, a press release from a company can be analyzed quickly using this model and immediately give the trader a positive or negative summary of the news. Would be beneficial in trading since time and speed is an important factor when executing a trade.

3. An appropriate application for Skim-RNN could be customer service chatbots as they can analyze a customer's message and skim associated company policies to craft a response. In this circumstance, quickly analyzing text is ideal to not waste customers' time.

4. This could be applied to news apps to improve readability by highlighting important sections.

5. This summary describes an interesting and useful model that can save readers time for reading an article. I think it will be interesting that discuss more on training a model by Skim-RNN to highlight the important sections in very long textbooks. As a student, having highlights in the textbook is really helpful to study. But highlight the important parts in a time-consuming work for the author, maybe using Skim-RNN can provide a nice model to do this job.

6. Besides the good training performance of Skim-RNN, it's good to see the algorithm even performs well simply by training with CPU. It would make it possible to perform the result on lite-platforms.

7. Another good application of Skim-RNN could be in reading the terms and conditions of websites. A lot of the terms and conditions documents tend to be painfully long and majority of customers because of the length of the document tend to sign them without reading them. Since these websites can compromise the personal information of the customers by giving it to third parties, it is worthwhile for the customers to know at least what they are signing. Infact, this can also be applied to read important legal official documents like lease agreements etc.

8. Another [https://arxiv.org/abs/1904.00761 paper], written by Christian Hansen et al., also discussed Neural Speed Reading. However, conducting Neural Speed reading via a Skim-RNN, this paper suggested using a Structural-Jump-LSTM. The Structural-Jump-LSTM can both skip and jump test during inference. This model consisted of a standard LSTM with 2 agents: 1 capable of bypassing single words and another capable of bypassing punctuation.

== Applications ==

Recurrent architectures are used in many other applications:

1. '''Real-time video processing''' is an exceedingly demanding and resource-constrained task, particularly in edge settings.

2. '''Music Recommend system''' is which many music platforms such as Spotify and Pandora used for recommending music based on the user's information.

3. '''Speech recognition''' enables the recognition and translation of spoken language into text by computers.

It would be interesting to see if this method could be applied to those cases for more efficient inference, such as on drones or self-driving cars. Another possible application is real-time edge processing of game video for sports arenas.

== References ==

[1] Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.

[2] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.

[3] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.

[4] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.

[5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[6] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017a.

[7] Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. In ICLR, 2017.

[8] Adams Wei Yu, Hongrae Lee, and Quoc V Le. Learning to skim text. In ACL, 2017.

[9] Eunsol Choi, Daniel Hewlett, Alexandre Lacoste, Illia Polosukhin, Jakob Uszkoreit, and Jonathan Berant. Coarse-to-fine question answering for long documents. In ACL, 2017.

Loss Function Search for Face Recognition

2020-11-30T01:40:40Z

D287zhan: /* Critiques */

== Presented by ==
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang

== Introduction ==
Face recognition is a technology that can label a face to a specific identity. The process involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face and another face map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness.

Hence, the paper introduced a new loss function which can reduce the softmax probability. Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:

<center><math>L_1=-log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center>

Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:

<center><math>L_2=-log\frac{e^{s cos{(\theta_{{w_y},x})}}}{e^{s cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s cos{(\theta_{{w_y},x})}}}}</math> [1] </center>

This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, they build upon the same structure as the equation above. Some of these variations will be discussed in detail in the later sections.

In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two design search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.

== Previous Work ==
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.

== Motivation ==
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions was high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed, and including margin-based formulations, they often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.

To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one parameter required and an improved search space using a reward-based method allows the authors to determine the best option for their loss function.

== Problem Formulation ==
=== Analysis of Margin-based Softmax Loss ===
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:

<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center>
<center> where <math>a=1-e^{s{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center>

<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.

Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability.

=== Random Search ===
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:

<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center>

This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.

=== Reward-Guided Search ===
Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is:

<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center>

where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.

<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment reaction to this action.

<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center>
<center>Figure 1: Reinforcement Learning scenario [4]</center>

The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5].

In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. <math>\mu</math> updates after each epoch from the reward function.

<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center>

=== Optimization ===
Calculating the reward involves a standard bi-level optimization problem, which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:

<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center>
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center>

In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.

== Results and Discussion ==
=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms. However, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.

Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boost the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discimination power. Also the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation.

<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center>

<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center>

=== Results on RFW ===
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3.

<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center>
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center>

<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center>
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center>

=== Results on MegaFace and Trillion-Pairs ===
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e-3</math> on MegaFace, the identification TPR@FAR = <math>1e-6</math> and the verification TPR@FAR = <math>1e-9</math> on Trillion-Pairs are reported on Table 4 and 5.

On the test sets MegaFace and Trillion-Pairs, Search-softmax achieves the best performance over all other alternative methods. On MegaFace, Search-softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to new designed search space.

<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center>
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center>

<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center>
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center>

From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a same trend on Trillion-Pairs where Search-softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.

<center>[[Image:G25_Figure2_left.png|450px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|450px |alt=Alt text|Title text]]</center>
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center>

== Conclusion ==
In this paper, it is discussed that in order to enhance feature discrimination for face recognition, it is key to know how to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets.

== Critiques ==
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper], written by Tae-Hyun Oh et al., they also discuss an optimal loss function for face recognition. However, since in the other paper, they were doing face recognition from voice audio, the loss function used was slightly different than the ones discussed in this paper.

== References ==
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.

[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.
2020].

[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020].

[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].

[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].

Loss Function Search for Face Recognition

2020-11-30T01:34:10Z

D287zhan: /* Critiques */

== Presented by ==
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang

== Introduction ==
Face recognition is a technology that can label a face to a specific identity. The process involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face and another face map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness.

Hence, the paper introduced a new loss function which can reduce the softmax probability. Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:

<center><math>L_1=-log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center>

Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:

<center><math>L_2=-log\frac{e^{s cos{(\theta_{{w_y},x})}}}{e^{s cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s cos{(\theta_{{w_y},x})}}}}</math> [1] </center>

This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, they build upon the same structure as the equation above. Some of these variations will be discussed in detail in the later sections.

In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two design search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.

== Previous Work ==
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.

== Motivation ==
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions was high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed, and including margin-based formulations, they often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.

To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one parameter required and an improved search space using a reward-based method allows the authors to determine the best option for their loss function.

== Problem Formulation ==
=== Analysis of Margin-based Softmax Loss ===
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:

<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center>
<center> where <math>a=1-e^{s{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center>

<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.

Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability.

=== Random Search ===
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:

<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center>

This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.

=== Reward-Guided Search ===
Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is:

<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center>

where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.

<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment reaction to this action.

<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center>
<center>Figure 1: Reinforcement Learning scenario [4]</center>

The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5].

In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. <math>\mu</math> updates after each epoch from the reward function.

<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center>

=== Optimization ===
Calculating the reward involves a standard bi-level optimization problem, which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:

<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center>
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center>

In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.

== Results and Discussion ==
=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms. However, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.

Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boost the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discimination power. Also the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation.

<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center>

<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center>

=== Results on RFW ===
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3.

<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center>
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center>

<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center>
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center>

=== Results on MegaFace and Trillion-Pairs ===
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e-3</math> on MegaFace, the identification TPR@FAR = <math>1e-6</math> and the verification TPR@FAR = <math>1e-9</math> on Trillion-Pairs are reported on Table 4 and 5.

On the test sets MegaFace and Trillion-Pairs, Search-softmax achieves the best performance over all other alternative methods. On MegaFace, Search-softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to new designed search space.

<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center>
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center>

<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center>
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center>

From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a same trend on Trillion-Pairs where Search-softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.

<center>[[Image:G25_Figure2_left.png|450px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|450px |alt=Alt text|Title text]]</center>
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center>

== Conclusion ==
In this paper, it is discussed that in order to enhance feature discrimination for face recognition, it is key to know how to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets.

== Critiques ==
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper]

== References ==
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.

[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.
2020].

[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020].

[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].

[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].

Mask RCNN

2020-11-30T01:25:50Z

D287zhan: /* Critiques */

== Presented by ==
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang

== Introduction ==
Mask RCNN [1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision.
Mask R-CNN, extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.

== Visual Perception tasks ==

- Image Classification: Predict a set of labels to characterize the contents of an input image

- Object Detection: Build on image classification but localize each object in an image

- Semantic Segmentation: Associate every pixel in an input image with a class label

- Instance Segmentation: Associate every pixel in an input image to a specific object

[[File:instance segmentation.png | center]]
<div align="center">Figure 1: Visual Perception tasks</div>

Mask RCNN is a deep neural network architecture for Instance Segmentation.

== Related Work ==
Region Proposal Network: A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.

ROI Pooling: The main use of ROI Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section.

Faster R-CNN: Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network, proposes candidate object bounding boxes.
The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.

[[File:FasterRCNN.png | center]]
<div align="center">Figure 2: Faster RCNN architecture</div>

ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is actually a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.

[[File:ResNetFPN.png | center]]
<div align="center">Figure 3: ResNetFPN architecture</div>

== Model Architecture ==
The structure of mask R-CNN is quite similar to the structure of faster R-CNN.
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI.

The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification
and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and
regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, a complete R-CNN pipeline stages involve: 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stage 3 and 4 are adjusted to simplify the network procedures.

The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.
One thing worth noticing is that for other network systems, those masks across classes compete with each other, but in this particular case, with a
per-pixel sigmoid and a binary loss the masks across classes no longer compete, which makes this formula the key for good instance segmentation results.

Another important concept involved is called the RoIAlign. This concept is useful in stage 2 where the RoIPool extracts
features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to pixel-to-pixel correspondence. The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. Also, instead of quantization, the coordinates are computed using bilinear interpolation to guarantee spatial correspondence.

The network architecture utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction.

There are some implementation details that should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.

== Results ==
[[File:ExpInstanceSeg.png | center]]
<div align="center">Figure 4: Instance Segmentation Experiments</div>

Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of art model

[[File:BoundingBoxExp.png | center]]
<div align="center">Figure 5: Bounding Box Detection Experiments</div>

Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.

== Ablation Experiments ==
[[File:BackboneExp.png | center]]
<div align="center">Figure 6: Backbone Architecture Experiments</div>

(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet.

[[File:MultiVSInde.png | center]]
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div>

(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).

[[File: RoIAlign.png | center]]
<div align="center">Figure 8: RoIAlign Experiments 1</div>

(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers.

[[File: RoIAlignExp.png | center]]
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div>

(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.

[[File:MaskBranchExp.png | center]]
<div align="center">Figure 10: Mask Branch Experiments</div>

(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.

== Conclusion ==
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.

== Critiques ==
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.

It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.

The paper lacks the comparisons of different methods and Mask RNN on unlabelled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.

The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.

== References ==
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.

[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.

[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015

Describtion of Text Mining

2020-11-30T01:17:09Z

D287zhan: /* Classification */

== Presented by ==
Yawen Wang, Danmeng Cui, Zijie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid

== Introduction ==
This paper focuses on the different text mining techniques and the applications of text mining in the healthcare and biomedical domain. The text mining field has been popular as a result of the amount of text data that is available in different forms. The text data is bound to grow even more in 2020, indicating a 50 times growth since 2010. Text is a kind of unstructured information, which is easy for humans to construct and understand, but it is difficult for machines. Hence, there is a need to design algorithms to effectively process this avalanche of text. To further explore the text mining field, the related text mining approaches can be considered. The different text mining approaches relate to two main methods: knowledge delivery and traditional data mining methods.

The authors note that knowledge delivery methods involve the application of different steps to a specific data set to create specific patterns. Research in knowledge delivery methods has evolved over the years due to advances in hardware and software technology. On the other hand, data mining has experienced substantial development through the intersection of three fields: databases, machine learning, and statistics. As brought out by the authors, text mining approaches focus on the exploration of information from a specific text. The information explored is in the form of structured, semi-structured, and unstructured text. It is important to note that text mining covers different sets of algorithms and topics that include information retrieval. The topics and algorithms are used for analyzing different text forms.

== Text Representation and Encoding ==
In this section of the paper, the authors explore the different ways in which the text can be represented on a large collection of documents. One common way of representing the documents is in the form of a bag of words. The bag of words considers the occurrences of different terms. In different text mining applications, documents are ranked and represented as vectors so as to display the significance of any word. The authors note that the three basic models used are vector space, inference network, and the probabilistic models. The vector space model is used to represent documents by converting them into vectors. In the model, a variable is used to represent each model to indicate the importance of the word in the document. The words are weighted using the TF-IDF scheme computed as

$$
q(w)=f_d(w)*log{\frac{|D|}{f_D(w)}}
$$

In many text mining algorithms, one of the key components is preprocessing. Preprocessing consists of different tasks that include filtering, tokenization, stemming, and lemmatization. The first step is tokenization, where a character sequence is broken down into different words or phrases. After the breakdown, filtering is carried out to remove some words. The various word inflected forms are grouped together through lemmatization, and later, the derived roots of the derived words are obtained through stemming.

== Classification ==
Classification in Text Mining aims to assigned predefined classes to text documents. For a set <math>\mathcal{D} = {d_1, d_2, ... d_n}</math> of documents, such that each <math>d_i</math> is mapped to a label <math>l_i</math> from the set <math>\mathcal{L} = {l_1, l_2, ... l_k}</math>. The goal is to find a classification model <math>f</math> such that: <math>\\</math>
$$
f: \mathcal{D} \rightarrow \mathcal{L} \quad \quad \quad f(\mathcal{d}) = \mathcal{l}
$$
The author illustrates 4 different classifiers that are commonly used in text mining.

'''1. Naive Bayes Classifier'''

Bayes rule is used to classify new examples and select the class that is most has the generated result.
Naive Bayes Classifier models the distribution of documents in each class using a probabilistic model assuming that the distribution
of different terms is independent of each other. The models commonly used in this classifier tried to find the posterior probability of a class based on the distribution and assumes that the documents generated are based on a mixture model parameterized by <math>\theta</math> and compute the likelihood of a document using the sum of probabilities over all mixture component. In addition, the Naive Bayes Classifier can help get around the curse of dimensionality, which may arise with high-dimensional data, such as text.

'''2. Nearest Neighbour Classifier'''

Nearest Neighbour Classifier uses distance-based measures to perform the classification. The documents which belong to the same class are more likely "similar" or close to each other based on the similarity measure. The classification of the test documents is inferred from the class labels of similar documents in the training set.

'''3. Decision Tree Classifier'''

A hierarchical tree of the training instances, in which a condition on the attribute value is used to divide the data hierarchically. The decision tree recursively partitions the training data set into smaller subdivisions based on a set of tests defined at each node or branch. Each node of the tree is a test of some attribute of the training instance, and each branch descending from the node corresponds to one of the values of this attribute. The conditions on the nodes are commonly defined by the terms in the text documents.

'''4. Support Vector Machines'''

SVM is a form of Linear Classifiers which are models that makes a classification decision based on the value of the linear combinations of the documents features. The output of a linear predictor is defined to the <math> y=\vec{a} \cdot \vec{x} + b</math> where <math>\vec{x}</math> is the normalized document word frequency vector, <math>\vec{a}</math> is a vector of coefficient and <math>b</math> is a scalar. Support Vector Machines attempts to find a linear separators between various classes. An advantage of the SVM method is it is robust to high dimensionality.

== Clustering ==
Clustering has been extensively studied in the context of the text as it has a wide range of applications such as visualization and document organization.

Clustering algorithms are used to group similar documents and thus aids in information retrieval. Text clustering can be in different levels of granularities, where clusters can be documents, paragraphs, sentences, or terms. Since text data has numerous distance characteristics that demand the design of text-specific algorithms for the task, using a binary vector to represent the text document is simply not enough. Here are some unique properties of text representation:

1. Text representation has a large dimensionality, in which the size of the vocabulary from which the documents are drawn is massive, but a document might only contain a small number of words.

2. The words in the documents are usually correlated with each other. Need to take the correlation into consideration when designing algorithms.

3. The number of words differs from one another of the document. Thus the document needs to be normalized first before the clustering process.

There are 3 most commonly used text clustering algorithms presented.

'''1. Hierarchical Clustering algorithms'''

Hierarchical Clustering algorithms build d a group of clusters that can be depicted as a hierarchy of clusters. The hierarchy can be constructed in top-down (divisive) or bottom-up (agglomeration). Hierarchical clustering algorithms are one of the Distanced-based clustering algorithms, i.e., using a similarity function to measure the closeness between text documents.

In the top-down approach, the algorithm begins with one cluster which includes all the documents. we recursively split this cluster into sub-clusters.
Here is an example of a Hierarchical Clustering algorithm, the data is to be clustered by the euclidean distance. This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step determines which elements to merge in a cluster by taking the two closest elements, according to the chosen distance.

[[File:418px-Hierarchical clustering simple diagram.svg.png| 300px | center]]

<div align="center">Figure 1: Hierarchical Clustering Raw Data</div>

[[File:250px-Clusters.svg (1).png| 200px | center]]

<div align="center">Figure 2: Hierarchical Clustering Clustered Data</div>

A main advantage of hierarchical clustering is that the algorithm only needs to be done once for any number of clusters (ie. if an individual wishes to use a different number of clusters than originally intended, they do not need to repeat the algorithm)

'''2. k-means Clustering'''

k-means clustering is a partitioning algorithm that partitions n documents in the context of text data into k clusters.

Input: Document D, similarity measure S, number k of cluster
Output: Set of k clusters
Select randomly ''k'' datapoints as starting centroids
While ''not converged'' do
Assign documents to the centroids based on the closest similarity
Calculate the cluster centroids for all clusters
return ''k clusters''

The main disadvantage of k-means clustering is that it is indeed very sensitive to the initial choice of the number of k.

'''3. Probabilistic Clustering and Topic Models'''

Topic modeling is one of the most popular probabilistic clustering algorithms in recent studies. The main idea is to create a *probabilistic generative model* for the corpus of text documents. In topic models, documents are a mixture of topics, where each topic represents a probability distribution over words.

There are two main topic models:
* Probabilistic Latent Semantic Analysis (pLSA)
* Latent Dirichlet Allocation (LDA)

The paper covers LDA in more detail. LDA is a state-of-the-art unsupervised algorithm for extracting topics from a collection of documents.

Given <math>\mathcal{D} = \{d_1, d_2, \cdots, d_{|\mathcal{D}|}\}</math> is the corpus and <math>\mathcal{V} = \{w_1, w_2, \cdots, w_{|\mathcal{V}|}\}</math> is the vocabulary of the corpus.

A topic is <math>z_j, 1 \leq j \leq K</math> is a multinomial probability distribution over <math>|\mathcal{V}|</math> words.

The distribution of word given document is:

<math>p(w_i|d) = \Sigma_{j=1}^K p(w_i|z_j)p(z_j|d)</math>

The LDA assumes the following generative process for the corpus of <math>\mathcal{D}</math>
* For each topic <math>k\in \{1,2,\cdots, K\}</math>, sample a word distribution <math>\phi_k \sim Dir(\beta)</math>
* For each document <math>d \in \{1,2,\cdots,D\}</math>
** Sample a topic distribution <math>\theta_d \sim Dir(\alpha)</math>
** For each word <math>w_n, n \in \{1,2,\cdots,N\}</math> in document <math>d</math>
*** Sample a topic <math>z_i \sim Mult(\theta_d)</math>
*** Sample a word <math>w_n \sim Mult(\phi_{z_i})</math>

In practice, LDA is often used as a module in more complicated models and has already been applied to a wide variety of domains. In addition, many variations of LDA has been created, including supervised LDA (sLDA) and hierarchical LDA (hLDA)

== Information Extraction ==
Information Extraction (IE) is the process of extracting useful, structured information from unstructured or semi-structured text. It automatically extracts based on our command.

For example, consider the following sentence, “XYZ company was founded by Peter in the year of 1950”
We can identify the following information:

Founderof(Peter, XYZ)
Foundedin(1950, XYZ)

The author mentioned 4 parts that are important for Information Extraction

'''1. Namely Entity Recognition(NER)'''

This is the process of identifying real-world entity from free text, such as "Apple Inc.", "Donald Trump", "PlayStation 5" etc. Moreover, the task is to identify the category of these entities, such as "Apple Inc." is in the category of the company, "Donald Trump" is in the category of the USA president, and "PlayStation 5" is in the category of the entertainment system.

'''2. Hidden Markov Model'''

Since traditional probabilistic classification does not consider the predicted labels of neighbor words, we use the Hidden Markov Model when doing Information Extraction. This model is different because it considers the label of one word depends on the previous words that appeared.

'''3. Conditional Random Fields'''

This is a technique that is widely used in Information Extraction. The definition of it is related to graph theory.
let G = (V, E) be a graph and Yv stands for the index of the vertices in G. Then (X, Y) is a conditional random field, when the random variables Yv, conditioned on X, obey Markov property with respect to the graph, and:
p(Yv |X, Yw ,w , v) = p(Yv |X, Yw ,w ∼ v), where w ∼ v means w and v are neighbors in G.

'''4. Relation Extraction'''

This is a task of finding semantic relationships between word entities in text documents. Such as "Seth Curry" is the brother of "Stephen Curry", if there is a document including these two names, the task is to identify the relationship of these two entities.

== Biomedical Application ==

Text mining has several applications in the domain of biomedical sciences. The explosion of academic literature in the field has made it quite hard for scientists to keep up with novel research. This is why text mining techniques are ever so important in making the knowledge digestible.

The text mining techniques are able to extract meaningful information from large data by making use of biomedical ontology, which is a compilation of a common set of terms used in an area of knowledge. The Unified Medical Language System (UMLS) is the most comprehensive such resource, consisting of definitions of biomedical jargon. Several information extraction algorithms rely on the ontology to perform tasks such as Named Entity Recognition (NER) and Relation Extraction.

NER involves locating and classifying biomedical entities into meaningful categories and assigning semantic representation to those entities. The NER methods can be broadly grouped into Dictionary-based, Rule-based and Statistical approaches. Relation extraction, on the other hand, is the process of determining relationships between the entities. This is accomplished mainly by identifying the correlation between entities through analyzing the frequency of terms, as well as rules defined by domain experts. Moreover, modern algorithms are also able to summarize large documents and answer natural language questions posed by humans.

== Conclusion ==

This paper gave a holistic overview of the methods and applications of text mining, particularly its relevance in the biomedical domain. It highlights several popular algorithms and summarizes them along with their advantages, limitations and some potential situations where they could be used. Because of ever-growing data, for example, the very high volume of scientific literature being produced every year, the interest in this field is massive and is bound to grow in the future.

== Critiques==

This is a very detailed approach to introduce some different algorithms on text mining. Since many algorithms are given, it might be a good idea to compare their performances on text mining by training them on some text data and compare them to the former baselines, to see if there exists any improvement.

it is a detailed summary of the techniques used in text mining. It would be more helpful if some dataset can be included for training and testing. The algorithms were grouped by different topics so that different datasets and measurements are required.

It would be better for the paper to include test accuracy for testing and training sets to support text mining is a more efficient and effective algorithm compared to other techniques. Moreover, this paper mentioned Text Mining approach can be used to extract high-quality information from videos. It is to believe that extracting from videos is much more difficult than images and texts. How is it possible to retain its test accuracy at a good level for videos?

== References ==

Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering, and extraction techniques. arXiv preprint arXiv:1707.02919.

User:Cvmustat

2020-11-30T01:10:24Z

D287zhan: /* Critiques */

== Combine Convolution with Recurrent Networks for Text Classification ==
'''Team Members''': Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea

'''Date''': Week of Nov 23

== Introduction ==

Text classification is the task of assigning a set of predefined categories to natural language texts, it involves learning an embedding layer which allows context dependent classification. It is a fundamental task in Natural Language Processing (NLP) with various applications such as sentiment analysis, and topic classification. A classic example involving text classification is given a set of News articles, is it possible to classify the genre or subject of each article? Text classification is useful as text data is a rich source of information, but extracting insights from it directly can be difficult and time-consuming as most text data is unstructured.[1] NLP text classification can help automatically structure and analyze text quickly and cost-effectively, allowing for individuals to extract important features from the text easier than before.

Text classification work mainly focuses on three topics: feature engineering, feature selection, and the use of different types of machine learning algorithms.
:1. Feature engineering, the most widely used feature is the bag of words feature. Some more complex functions are also designed, such as part-of-speech tags, noun phrases, and tree kernels.
:2. Feature selection aims to remove noisy features and improve classification performance. The most common feature selection method is to delete stop words.
:3. Machine learning algorithms usually use classifiers, such as Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machine (SVM).

In practice, pre-trained word embeddings and deep neural networks are used together for NLP text classification. Word embeddings are used to map the raw text data to an implicit space where the semantic relationships of the words are preserved; words with similar meaning have a similar representation. One can then feed these embeddings into deep neural networks to learn different features of the text. Convolutional neural networks can be used to determine the semantic composition of the text(the meaning), as it treats texts as a 2D matrix by concatenating the embedding of words together. It uses a 1D convolution operator to perform the feature mapping and then conducts a 1D pooling operation over the time domain for obtaining a fixed-length output feature vector, and it can capture both local and position invariant features of the text.[2] Alternatively, Recurrent Neural Networks can be used to determine the contextual meaning of each word in the text (how each word relates to one another) by treating the text as sequential data and then analyzing each word separately. [3] Previous approaches to attempt to combine these two neural networks to incorporate the advantages of both models involve streamlining the two networks, which might decrease their performance. Besides, most methods incorporating a bi-directional Recurrent Neural Network usually concatenate the forward and backward hidden states at each time step, which results in a vector that does not have the interaction information between the forward and backward hidden states.[4] The hidden state in one direction contains only the contextual meaning in that particular direction, however a word's contextual representation, intuitively, is more accurate when collected and viewed from both directions. This paper argues that the failure to observe the meaning of a word in both directions causes the loss of the true meaning of the word, especially for polysemic words (words with more than one meaning) that are context-sensitive.

== Paper Key Contributions ==

This paper suggests an enhanced method of text classification by proposing a new way of combining Convolutional and Recurrent Neural Networks (CRNN) involving the addition of a neural tensor layer. The proposed method maintains each network's respective strengths that are normally lost in previous combination methods. The new suggested architecture is called CRNN, which utilizes both a CNN and RNN that run in parallel on the same input sentence. CNN uses weight matrix learning and produces a 2D matrix that shows the importance of each word based on local and position-invariant features. The bidirectional RNN produces a matrix that learns each word's contextual representation; the words' importance in relation to the rest of the sentence. A neural tensor layer is introduced on top of the RNN to obtain the fusion of bi-directional contextual information surrounding a particular word. This method combines these two matrix representations and classifies the text, providing the important information of each word for prediction, which helps to explain the results. The model also uses dropout and L2 regularization to prevent overfitting.

== CRNN Results vs Benchmarks ==

In order to benchmark the performance of the CRNN model, as well as compare it to other previous efforts, multiple datasets and classification problems were used. All of these datasets are publicly available and can be easily downloaded by any user for testing.

- '''Movie Reviews:''' a sentiment analysis dataset, with two classes (positive and negative).

- '''Yelp:''' a sentiment analysis dataset, with five classes. For this test, a subset of 120,000 reviews was randomly chosen from each class for a total of 600,000 reviews.

- '''AG's News:''' a news categorization dataset, using only the 4 largest classes from the dataset. There are 30000 training samples and 1900 test samples.

- '''20 Newsgroups:''' a news categorization dataset, again using only 4 large classes (comp, politics, rec, and religion) from the dataset.

- '''Sogou News:''' a Chinese news categorization dataset, using 10 major classes as a multi-class classification and include 6500 samples randomly from each class.

- '''Yahoo! Answers:''' a topic classification dataset, with 10 classes and each class contains 140000 training samples and 5000 testing samples.

For the English language datasets, the initial word representations were created using the publicly available ''word2vec'' [https://code.google.com/p/word2vec/] from Google news. For the Chinese language dataset, ''jieba'' [https://github.com/fxsjy/jieba] was used to segment sentences, and then 50-dimensional word vectors were trained on Chinese ''wikipedia'' using ''word2vec''.

A number of other models are run against the same data after preprocessing. Some of these models include:

- '''Self-attentive LSTM:''' an LSTM model with multi-hop attention for sentence embedding.

- '''RCNN:''' the RCNN's recurrent structure allows for increased depth of capture for contextual information. Less noise is introduced on account of the model's holistic structure (compared to local features).

The following results are obtained:

[[File:table of results.png|550px|center]]

The bold results represent the best performing model for a given dataset. These results show that the CRNN model is the best model for 4 of the 6 datasets, with the Self-attentive LSTM beating the CRNN by 0.03 and 0.12 on the news categorization problems. Considering that the CRNN model has better performance than the Self-attentive LSTM on the other 4 datasets, this suggests that the CRNN model is a better performer overall in the conditions of this benchmark.

It should be noted that including the neural tensor layer in the CRNN model leads to a significant performance boost compared to the CRNN models without it. The performance boost can be attributed to the fact that the neural tensor layer captures the surrounding contextual information for each word, and brings this information between the forward and backward RNN in a direct method. As seen in the table, this leads to a better classification accuracy across all datasets.

Another important result was that the CRNN model filter size impacted performance only in the sentiment analysis datasets, as seen in the following:

[[File:filter_effects.png|550px|center]]

== CRNN Model Architecture ==

The CRNN model is a combination of RNN and CNN. It uses CNN to compute the importance of each word in the text and utilizes a neural tensor layer to fuse forward and backward hidden states of bi-directional RNN.

The input of the network is a text, which is a sequence of words. The output of the network is the text representation that is subsequently used as input of a fully-connected layer to obtain the class prediction.

'''RNN Pipeline:'''

The goal of the RNN pipeline is to input each word in a text, and retrieve the contextual information surrounding the word and compute the contextual representation of the word itself. This is accomplished by the use of a bi-directional RNN, such that a Neural Tensor Layer (NTL) can combine the results of the RNN to obtain the final output. RNNs are well-suited to NLP tasks because of their ability to sequentially process data such as ordered text.

A RNN is similar to a feed-forward neural network, but it relies on the use of hidden states. Hidden states are layers in the neural net that produce two outputs: <math> \hat{y}_{t} </math> and <math> h_t </math>. For a time step <math> t </math>, <math> h_t </math> is fed back into the layer to compute <math> \hat{y}_{t+1} </math> and <math> h_{t+1} </math>.

The pipeline will actually use a variant of RNN called GRU, short for Gated Recurrent Units. This is done to address the vanishing gradient problem which causes the network to struggle to memorize words that came earlier in the sequence. Traditional RNNs are only able to remember the most recent words in a sequence, which may be problematic since words that came at the beginning of the sequence that is important to the classification problem may be forgotten. A GRU attempts to solve this by controlling the flow of information through the network using update and reset gates.

Let <math>h_{t-1} \in \mathbb{R}^m, x_t \in \mathbb{R}^d </math> be the inputs, and let <math>\mathbf{W}_z, \mathbf{W}_r, \mathbf{W}_h \in \mathbb{R}^{m \times d}, \mathbf{U}_z, \mathbf{U}_r, \mathbf{U}_h \in \mathbb{R}^{m \times m}</math> be trainable weight matrices. Then the following equations describe the update and reset gates:

<math>
z_t = \sigma(\mathbf{W}_zx_t + \mathbf{U}_zh_{t-1}) \text{update gate} \\
r_t = \sigma(\mathbf{W}_rx_t + \mathbf{U}_rh_{t-1}) \text{reset gate} \\
\tilde{h}_t = \text{tanh}(\mathbf{W}_hx_t + r_t \circ \mathbf{U}_hh_{t-1}) \text{new memory} \\
h_t = (1-z_t)\circ \tilde{h}_t + z_t\circ h_{t-1}
</math>

Note that <math> \sigma, \text{tanh}, \circ </math> are all element-wise functions. The above equations do the following:

<ol>
<li> <math>h_{t-1}</math> carries information from the previous iteration and <math>x_t</math> is the current input </li>
<li> the update gate <math>z_t</math> controls how much past information should be forwarded to the next hidden state </li>
<li> the rest gate <math>r_t</math> controls how much past information is forgotten or reset </li>
<li> new memory <math>\tilde{h}_t</math> contains the relevant past memory as instructed by <math>r_t</math> and current information from the input <math>x_t</math> </li>
<li> then <math>z_t</math> is used to control what is passed on from <math>h_{t-1}</math> and <math>(1-z_t)</math> controls the new memory that is passed on
</ol>

We treat <math>h_0</math> and <math> h_{n+1} </math> as zero vectors in the method. Thus, each <math>h_t</math> can be computed as above to yield results for the bi-directional RNN. The resulting hidden states <math>\overrightarrow{h_t}</math> and <math>\overleftarrow{h_t}</math> contain contextual information around the <math> t</math>-th word in forward and backward directions respectively. Contrary to convention, instead of concatenating these two vectors, it is argued that the word's contextual representation is more precise when the context information from different directions is collected and fused using a neural tensor layer as it permits greater interactions among each element of hidden states. Using these two vectors as input to the neural tensor layer, <math>V^i </math>, we compute a new representation that aggregates meanings from the forward and backward hidden states more accurately as follows:

<math>
[\hat{h_t}]_i = tanh(\overrightarrow{h_t}V^i\overleftarrow{h_t} + b_i)
</math>

Where <math>V^i \in \mathbb{R}^{m \times m} </math> is the learned tensor layer, and <math> b_i \in \mathbb{R} </math> is the bias.We repeat this <math> m </math> times with different <math>V^i </math> matrices and <math> b_i </math> vectors. Through the neural tensor layer, each element in <math> [\hat{h_t}]_i </math> can be viewed as a different type of intersection between the forward and backward hidden states. In the model, <math> [\hat{h_t}]_i </math> will have the same size as the forward and backward hidden states. We then concatenate the three hidden states vectors to form a new vector that summarizes the context information :
<math>
\overleftrightarrow{h_t} = [\overrightarrow{h_t}^T,\overleftarrow{h_t}^T,\hat{h_t}]^T
</math>

We calculate this vector for every word in the text and then stack them all into matrix <math> H </math> with shape <math>n</math>-by-<math>3m</math>.

<math>
H = [\overleftrightarrow{h_1};...\overleftrightarrow{h_n}]
</math>

This <math>H</math> matrix is then forwarded as the results from the Recurrent Neural Network.

'''CNN Pipeline:'''

The goal of the CNN pipeline is to learn the relative importance of words in an input sequence based on different aspects. The process of this CNN pipeline is summarized as the following steps:

<ol>
<li> Given a sequence of words, each word is converted into a word vector using the word2vec algorithm which gives matrix X.
</li>

<li> Word vectors are then convolved through the temporal dimension with filters of various sizes (ie. different K) with learnable weights to capture various numerical K-gram representations. These K-gram representations are stored in matrix C.
</li>

<ul>
<li> The convolution makes this process capture local and position-invariant features. Local means the K words are contiguous. Position-invariant means K contiguous words at any position are detected in this case via convolution.

<li> Temporal dimension example: convolve words from 1 to K, then convolve words 2 to K+1, etc
</li>
</ul>

<li> Since not all K-gram representations are equally meaningful, there is a learnable matrix W which takes the linear combination of K-gram representations to more heavily weigh the more important K-gram representations for the classification task.
</li>

<li> Each linear combination of the K-gram representations gives the relative word importance based on the aspect that the linear combination encodes.
</li>

<li> The relative word importance vs aspect gives rise to an interpretable attention matrix A, where each element says the relative importance of a specific word for a specific aspect.
</li>

</ol>

[[File:Group12_Figure1.png |center]]

<div align="center">Figure 1: The architecture of CRNN.</div>

== Merging RNN & CNN Pipeline Outputs ==

The results from both the RNN and CNN pipeline can be merged by simply multiplying the output matrices. That is, we compute <math>S=A^TH</math> which has shape <math>z \times 3m</math> and is essentially a linear combination of the hidden states. The concatenated rows of S results in a vector in <math>\mathbb{R}^{3zm}</math> and can be passed to a fully connected Softmax layer to output a vector of probabilities for our classification task.

To train the model, we make the following decisions:
<ul>
<li> Use cross-entropy loss as the loss function (A cross-entropy loss function usually takes in two distributions, a true distribution p and an estimated distribution q, and measures the average number of bits need to identify an event. This calculation is independent of the kind of layers used in the network as well as the kind of activation being implemented.) </li>
<li> Perform dropout on random columns in matrix C in the CNN pipeline </li>
<li> Perform L2 regularization on all parameters </li>
<li> Use stochastic gradient descent with a learning rate of 0.001 </li>
</ul>

== Interpreting Learned CRNN Weights ==

Recall that attention matrix A essentially stores the relative importance of every word in the input sequence for every aspect chosen. Naturally, this means that A is an n-by-z matrix, with n being the number of words in the input sequence and z being the number of aspects considered in the classification task.

Furthermore, for any specific aspect, words with higher attention values are more important relative to other words in the same input sequence. likewise, for any specific word, aspects with higher attention values prioritize the specific word more than other aspects.

For example, in this paper, a sentence is sampled from the Movie Reviews dataset, and the transpose of attention matrix A is visualized. Each word represents an element in matrix A, the intensity of red represents the magnitude of an attention value in A, and each sentence is the relative importance of each word for a specific context. In the first row, the words are weighted in terms of a positive aspect, in the last row, the words are weighted in terms of a negative aspect, and in the middle row, the words are weighted in terms of a positive and negative aspect. Notice how the relative importance of words is a function of the aspect.

[[File:Interpretation example.png|800px|center]]

From the above sample, it is interesting that the word "but" is viewed as a negative aspect. From a linguistic perspective, the semantic of "but" is incredibly difficult to capture because of the degree of contextual information it needs. In this case, "but" is in the middle of a transition from a negative to a positive so the first row should also have given attention to that word. Also, it seems that the model has learned to give very high attention to the two words directly adjacent to the word of high attention: "is" and "and" beside "powerful", and "an" and "cast" beside "unwieldy".

== Conclusion & Summary ==

This paper proposed a new architecture, the Convolutional Recurrent Neural Network, for text classification. The Convolutional Neural Network is used to learn the relative importance of each word from their different aspects and stores it this information into a weight matrix. The Recurrent Neural Network learns each word's contextual representation through the combination of the forward and backward context information that is fused using a neural tensor layer and is stored as a matrix. These two matrices are then combined to get the text representation used for classification. Although the specifics of the performed tests are lacking, the experiment's results indicate that the proposed method performed well in comparison to most previous methods. In addition to performing well, the proposed method also provides insight into which words contribute greatly to the classification decision as the learned matrix from the Convolutional Neural Network stores the relative importance of each word. This information can then be used in other applications or analyses. In the future, one can explore the features extracted from the model and use them to potentially learn new methods such as model space. [5]

== Critiques ==

In the '''Method''' section of the paper, some explanations used the same notation for multiple different elements of the model. This made the paper harder to follow and understand since they were referring to different elements by identical notation. Additionally, the decision to use sigmoid and hyperbolic tangent functions as nonlinearities for representation learning, is not supported with evidence that these are optimal.

In '''Comparison of Methods''', the authors discuss the range of hyperparameter settings that they search through. While some of the hyperparameters have a large range of search values, three parameters are fixed without much explanation as to why for all experiments, size of the hidden state of GRU, number of layers, and dropout. These parameters have a lot to do with the complexity of the model and this paper could be improved by providing relevant reasoning behind these values, or by providing additional experimental results over different values of these parameters.

In the '''Results''' section of the paper, they tried to show that the classification results from the CRNN model can be better interpreted than other models. In these explanations, the details were lacking and the authors did not adequately demonstrate how their model is better than others.

Finally, in the '''Results''' section again, the paper compares the CRNN model to several models which they did not implement and reproduce results with. This can be seen in the chart of results above, where several models do not have entries in the table for all six datasets. Since the authors used a subset of the datasets, these other models which were not reproduced could have different accuracy scores if they had been tested on the same data as the CRNN model. This difference in training and testing data is not mentioned in the paper, and the conclusion that the CRNN model is better in all cases may not be valid.

- Could this be applied to hieroglyphs to decipher/better understand them?

It would be interesting to see how the attention matrix is being constructed and how attention values are being determined in each matrix. For instance, does every different subject have its own attention matrix? If so, how will the situation be handled when the same attention matrix is used in different settings?

-This is an interesting topic. I think it will be better to show more results by using this method. Maybe it will be better to put the result part after the architecture part? Writing a motivation will be better since it will catch readers' "eyes". I think it will be interesting to ask: whether can we apply this to ancient Chinese poetry? Since there are lots of types of ancient Chinese poetry, doing a classification for them will be interesting.

This is an interesting method, I would be curious to see if this can be combined or compared with Quasi-Recurrent Neural Networks (https://arxiv.org/abs/1611.01576). In my experience, QRNNs perform similarly to LSTMs while running significantly faster using convolutions with a special temporal pooling. This seems compatible with the neural tensor layer proposed in this paper, which may be combined to yield stronger performance with faster runtimes.

-The paper shows the CRNN model not performing the best with Ag's news and 20newsgroups. It would be interesting to investigate this in detail and see the difference in the way the data is handled in the model compared to the best performing model(self-attentive LSTM in both datasets).

-From experiments, LSTM outperforms CRNN in some cases. It would be interesting to compare CNN+LSTM and CRNN's performance. Another application for CRNN might be classifying spoken language.

- From the Interpreting Learned CRNN Weights part, the samples are labeled as positive and negative, and their words all have opposite emotional polarities. It can be observed that regardless of whether the polarity of the example is positive or negative, the keyword can be extracted by this method, reflecting that it can capture multiple semantically meaningful components. At the same time it will be very interesting to see if this method is applicable to other specific categories.

- The authors of this paper provide 2 examples of what topic classification is, but do not provide any explicit examples of "polysemic words whose meanings are context-sensitive", one of their main critiques of current methods. This is an opportunity to promote the usefulness of their method and engage and inform the reader, simply by listing examples of these words.

- In another [https://www.aclweb.org/anthology/W99-0908/ paper] written by Andrew McCallum and Kamal Nigam, they introduce a different method of text classification. Namely, instead of a combination of recurrent and convolutional neural networks, they instead utilized bootstrapping with keywords, Expectation-Maximization algorithm, and shrinkage.

== References ==
----

[1] Grimes, Seth. “Unstructured Data and the 80 Percent Rule.” Breakthrough Analysis, 1 Aug. 2008, breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/.

[2] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modeling sentences,”
arXiv preprint arXiv:1404.2188, 2014.

[3] K. Cho, B. V. Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning
phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.

[4] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in Proceedings
of AAAI, 2015, pp. 2267–2273.

[5] H. Chen, P. Tio, A. Rodan, and X. Yao, “Learning in the model space for cognitive fault diagnosis,” IEEE
Transactions on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 124–136, 2014.

User:Cvmustat

2020-11-30T01:10:00Z

D287zhan: /* Critiques */

== Combine Convolution with Recurrent Networks for Text Classification ==
'''Team Members''': Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea

'''Date''': Week of Nov 23

== Introduction ==

Text classification is the task of assigning a set of predefined categories to natural language texts, it involves learning an embedding layer which allows context dependent classification. It is a fundamental task in Natural Language Processing (NLP) with various applications such as sentiment analysis, and topic classification. A classic example involving text classification is given a set of News articles, is it possible to classify the genre or subject of each article? Text classification is useful as text data is a rich source of information, but extracting insights from it directly can be difficult and time-consuming as most text data is unstructured.[1] NLP text classification can help automatically structure and analyze text quickly and cost-effectively, allowing for individuals to extract important features from the text easier than before.

Text classification work mainly focuses on three topics: feature engineering, feature selection, and the use of different types of machine learning algorithms.
:1. Feature engineering, the most widely used feature is the bag of words feature. Some more complex functions are also designed, such as part-of-speech tags, noun phrases, and tree kernels.
:2. Feature selection aims to remove noisy features and improve classification performance. The most common feature selection method is to delete stop words.
:3. Machine learning algorithms usually use classifiers, such as Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machine (SVM).

In practice, pre-trained word embeddings and deep neural networks are used together for NLP text classification. Word embeddings are used to map the raw text data to an implicit space where the semantic relationships of the words are preserved; words with similar meaning have a similar representation. One can then feed these embeddings into deep neural networks to learn different features of the text. Convolutional neural networks can be used to determine the semantic composition of the text(the meaning), as it treats texts as a 2D matrix by concatenating the embedding of words together. It uses a 1D convolution operator to perform the feature mapping and then conducts a 1D pooling operation over the time domain for obtaining a fixed-length output feature vector, and it can capture both local and position invariant features of the text.[2] Alternatively, Recurrent Neural Networks can be used to determine the contextual meaning of each word in the text (how each word relates to one another) by treating the text as sequential data and then analyzing each word separately. [3] Previous approaches to attempt to combine these two neural networks to incorporate the advantages of both models involve streamlining the two networks, which might decrease their performance. Besides, most methods incorporating a bi-directional Recurrent Neural Network usually concatenate the forward and backward hidden states at each time step, which results in a vector that does not have the interaction information between the forward and backward hidden states.[4] The hidden state in one direction contains only the contextual meaning in that particular direction, however a word's contextual representation, intuitively, is more accurate when collected and viewed from both directions. This paper argues that the failure to observe the meaning of a word in both directions causes the loss of the true meaning of the word, especially for polysemic words (words with more than one meaning) that are context-sensitive.

== Paper Key Contributions ==

This paper suggests an enhanced method of text classification by proposing a new way of combining Convolutional and Recurrent Neural Networks (CRNN) involving the addition of a neural tensor layer. The proposed method maintains each network's respective strengths that are normally lost in previous combination methods. The new suggested architecture is called CRNN, which utilizes both a CNN and RNN that run in parallel on the same input sentence. CNN uses weight matrix learning and produces a 2D matrix that shows the importance of each word based on local and position-invariant features. The bidirectional RNN produces a matrix that learns each word's contextual representation; the words' importance in relation to the rest of the sentence. A neural tensor layer is introduced on top of the RNN to obtain the fusion of bi-directional contextual information surrounding a particular word. This method combines these two matrix representations and classifies the text, providing the important information of each word for prediction, which helps to explain the results. The model also uses dropout and L2 regularization to prevent overfitting.

== CRNN Results vs Benchmarks ==

In order to benchmark the performance of the CRNN model, as well as compare it to other previous efforts, multiple datasets and classification problems were used. All of these datasets are publicly available and can be easily downloaded by any user for testing.

- '''Movie Reviews:''' a sentiment analysis dataset, with two classes (positive and negative).

- '''Yelp:''' a sentiment analysis dataset, with five classes. For this test, a subset of 120,000 reviews was randomly chosen from each class for a total of 600,000 reviews.

- '''AG's News:''' a news categorization dataset, using only the 4 largest classes from the dataset. There are 30000 training samples and 1900 test samples.

- '''20 Newsgroups:''' a news categorization dataset, again using only 4 large classes (comp, politics, rec, and religion) from the dataset.

- '''Sogou News:''' a Chinese news categorization dataset, using 10 major classes as a multi-class classification and include 6500 samples randomly from each class.

- '''Yahoo! Answers:''' a topic classification dataset, with 10 classes and each class contains 140000 training samples and 5000 testing samples.

For the English language datasets, the initial word representations were created using the publicly available ''word2vec'' [https://code.google.com/p/word2vec/] from Google news. For the Chinese language dataset, ''jieba'' [https://github.com/fxsjy/jieba] was used to segment sentences, and then 50-dimensional word vectors were trained on Chinese ''wikipedia'' using ''word2vec''.

A number of other models are run against the same data after preprocessing. Some of these models include:

- '''Self-attentive LSTM:''' an LSTM model with multi-hop attention for sentence embedding.

- '''RCNN:''' the RCNN's recurrent structure allows for increased depth of capture for contextual information. Less noise is introduced on account of the model's holistic structure (compared to local features).

The following results are obtained:

[[File:table of results.png|550px|center]]

The bold results represent the best performing model for a given dataset. These results show that the CRNN model is the best model for 4 of the 6 datasets, with the Self-attentive LSTM beating the CRNN by 0.03 and 0.12 on the news categorization problems. Considering that the CRNN model has better performance than the Self-attentive LSTM on the other 4 datasets, this suggests that the CRNN model is a better performer overall in the conditions of this benchmark.

It should be noted that including the neural tensor layer in the CRNN model leads to a significant performance boost compared to the CRNN models without it. The performance boost can be attributed to the fact that the neural tensor layer captures the surrounding contextual information for each word, and brings this information between the forward and backward RNN in a direct method. As seen in the table, this leads to a better classification accuracy across all datasets.

Another important result was that the CRNN model filter size impacted performance only in the sentiment analysis datasets, as seen in the following:

[[File:filter_effects.png|550px|center]]

== CRNN Model Architecture ==

The CRNN model is a combination of RNN and CNN. It uses CNN to compute the importance of each word in the text and utilizes a neural tensor layer to fuse forward and backward hidden states of bi-directional RNN.

The input of the network is a text, which is a sequence of words. The output of the network is the text representation that is subsequently used as input of a fully-connected layer to obtain the class prediction.

'''RNN Pipeline:'''

The goal of the RNN pipeline is to input each word in a text, and retrieve the contextual information surrounding the word and compute the contextual representation of the word itself. This is accomplished by the use of a bi-directional RNN, such that a Neural Tensor Layer (NTL) can combine the results of the RNN to obtain the final output. RNNs are well-suited to NLP tasks because of their ability to sequentially process data such as ordered text.

A RNN is similar to a feed-forward neural network, but it relies on the use of hidden states. Hidden states are layers in the neural net that produce two outputs: <math> \hat{y}_{t} </math> and <math> h_t </math>. For a time step <math> t </math>, <math> h_t </math> is fed back into the layer to compute <math> \hat{y}_{t+1} </math> and <math> h_{t+1} </math>.

The pipeline will actually use a variant of RNN called GRU, short for Gated Recurrent Units. This is done to address the vanishing gradient problem which causes the network to struggle to memorize words that came earlier in the sequence. Traditional RNNs are only able to remember the most recent words in a sequence, which may be problematic since words that came at the beginning of the sequence that is important to the classification problem may be forgotten. A GRU attempts to solve this by controlling the flow of information through the network using update and reset gates.

Let <math>h_{t-1} \in \mathbb{R}^m, x_t \in \mathbb{R}^d </math> be the inputs, and let <math>\mathbf{W}_z, \mathbf{W}_r, \mathbf{W}_h \in \mathbb{R}^{m \times d}, \mathbf{U}_z, \mathbf{U}_r, \mathbf{U}_h \in \mathbb{R}^{m \times m}</math> be trainable weight matrices. Then the following equations describe the update and reset gates:

<math>
z_t = \sigma(\mathbf{W}_zx_t + \mathbf{U}_zh_{t-1}) \text{update gate} \\
r_t = \sigma(\mathbf{W}_rx_t + \mathbf{U}_rh_{t-1}) \text{reset gate} \\
\tilde{h}_t = \text{tanh}(\mathbf{W}_hx_t + r_t \circ \mathbf{U}_hh_{t-1}) \text{new memory} \\
h_t = (1-z_t)\circ \tilde{h}_t + z_t\circ h_{t-1}
</math>

Note that <math> \sigma, \text{tanh}, \circ </math> are all element-wise functions. The above equations do the following:

<ol>
<li> <math>h_{t-1}</math> carries information from the previous iteration and <math>x_t</math> is the current input </li>
<li> the update gate <math>z_t</math> controls how much past information should be forwarded to the next hidden state </li>
<li> the rest gate <math>r_t</math> controls how much past information is forgotten or reset </li>
<li> new memory <math>\tilde{h}_t</math> contains the relevant past memory as instructed by <math>r_t</math> and current information from the input <math>x_t</math> </li>
<li> then <math>z_t</math> is used to control what is passed on from <math>h_{t-1}</math> and <math>(1-z_t)</math> controls the new memory that is passed on
</ol>

We treat <math>h_0</math> and <math> h_{n+1} </math> as zero vectors in the method. Thus, each <math>h_t</math> can be computed as above to yield results for the bi-directional RNN. The resulting hidden states <math>\overrightarrow{h_t}</math> and <math>\overleftarrow{h_t}</math> contain contextual information around the <math> t</math>-th word in forward and backward directions respectively. Contrary to convention, instead of concatenating these two vectors, it is argued that the word's contextual representation is more precise when the context information from different directions is collected and fused using a neural tensor layer as it permits greater interactions among each element of hidden states. Using these two vectors as input to the neural tensor layer, <math>V^i </math>, we compute a new representation that aggregates meanings from the forward and backward hidden states more accurately as follows:

<math>
[\hat{h_t}]_i = tanh(\overrightarrow{h_t}V^i\overleftarrow{h_t} + b_i)
</math>

Where <math>V^i \in \mathbb{R}^{m \times m} </math> is the learned tensor layer, and <math> b_i \in \mathbb{R} </math> is the bias.We repeat this <math> m </math> times with different <math>V^i </math> matrices and <math> b_i </math> vectors. Through the neural tensor layer, each element in <math> [\hat{h_t}]_i </math> can be viewed as a different type of intersection between the forward and backward hidden states. In the model, <math> [\hat{h_t}]_i </math> will have the same size as the forward and backward hidden states. We then concatenate the three hidden states vectors to form a new vector that summarizes the context information :
<math>
\overleftrightarrow{h_t} = [\overrightarrow{h_t}^T,\overleftarrow{h_t}^T,\hat{h_t}]^T
</math>

We calculate this vector for every word in the text and then stack them all into matrix <math> H </math> with shape <math>n</math>-by-<math>3m</math>.

<math>
H = [\overleftrightarrow{h_1};...\overleftrightarrow{h_n}]
</math>

This <math>H</math> matrix is then forwarded as the results from the Recurrent Neural Network.

'''CNN Pipeline:'''

The goal of the CNN pipeline is to learn the relative importance of words in an input sequence based on different aspects. The process of this CNN pipeline is summarized as the following steps:

<ol>
<li> Given a sequence of words, each word is converted into a word vector using the word2vec algorithm which gives matrix X.
</li>

<li> Word vectors are then convolved through the temporal dimension with filters of various sizes (ie. different K) with learnable weights to capture various numerical K-gram representations. These K-gram representations are stored in matrix C.
</li>

<ul>
<li> The convolution makes this process capture local and position-invariant features. Local means the K words are contiguous. Position-invariant means K contiguous words at any position are detected in this case via convolution.

<li> Temporal dimension example: convolve words from 1 to K, then convolve words 2 to K+1, etc
</li>
</ul>

<li> Since not all K-gram representations are equally meaningful, there is a learnable matrix W which takes the linear combination of K-gram representations to more heavily weigh the more important K-gram representations for the classification task.
</li>

<li> Each linear combination of the K-gram representations gives the relative word importance based on the aspect that the linear combination encodes.
</li>

<li> The relative word importance vs aspect gives rise to an interpretable attention matrix A, where each element says the relative importance of a specific word for a specific aspect.
</li>

</ol>

[[File:Group12_Figure1.png |center]]

<div align="center">Figure 1: The architecture of CRNN.</div>

== Merging RNN & CNN Pipeline Outputs ==

The results from both the RNN and CNN pipeline can be merged by simply multiplying the output matrices. That is, we compute <math>S=A^TH</math> which has shape <math>z \times 3m</math> and is essentially a linear combination of the hidden states. The concatenated rows of S results in a vector in <math>\mathbb{R}^{3zm}</math> and can be passed to a fully connected Softmax layer to output a vector of probabilities for our classification task.

To train the model, we make the following decisions:
<ul>
<li> Use cross-entropy loss as the loss function (A cross-entropy loss function usually takes in two distributions, a true distribution p and an estimated distribution q, and measures the average number of bits need to identify an event. This calculation is independent of the kind of layers used in the network as well as the kind of activation being implemented.) </li>
<li> Perform dropout on random columns in matrix C in the CNN pipeline </li>
<li> Perform L2 regularization on all parameters </li>
<li> Use stochastic gradient descent with a learning rate of 0.001 </li>
</ul>

== Interpreting Learned CRNN Weights ==

Recall that attention matrix A essentially stores the relative importance of every word in the input sequence for every aspect chosen. Naturally, this means that A is an n-by-z matrix, with n being the number of words in the input sequence and z being the number of aspects considered in the classification task.

Furthermore, for any specific aspect, words with higher attention values are more important relative to other words in the same input sequence. likewise, for any specific word, aspects with higher attention values prioritize the specific word more than other aspects.

For example, in this paper, a sentence is sampled from the Movie Reviews dataset, and the transpose of attention matrix A is visualized. Each word represents an element in matrix A, the intensity of red represents the magnitude of an attention value in A, and each sentence is the relative importance of each word for a specific context. In the first row, the words are weighted in terms of a positive aspect, in the last row, the words are weighted in terms of a negative aspect, and in the middle row, the words are weighted in terms of a positive and negative aspect. Notice how the relative importance of words is a function of the aspect.

[[File:Interpretation example.png|800px|center]]

From the above sample, it is interesting that the word "but" is viewed as a negative aspect. From a linguistic perspective, the semantic of "but" is incredibly difficult to capture because of the degree of contextual information it needs. In this case, "but" is in the middle of a transition from a negative to a positive so the first row should also have given attention to that word. Also, it seems that the model has learned to give very high attention to the two words directly adjacent to the word of high attention: "is" and "and" beside "powerful", and "an" and "cast" beside "unwieldy".

== Conclusion & Summary ==

This paper proposed a new architecture, the Convolutional Recurrent Neural Network, for text classification. The Convolutional Neural Network is used to learn the relative importance of each word from their different aspects and stores it this information into a weight matrix. The Recurrent Neural Network learns each word's contextual representation through the combination of the forward and backward context information that is fused using a neural tensor layer and is stored as a matrix. These two matrices are then combined to get the text representation used for classification. Although the specifics of the performed tests are lacking, the experiment's results indicate that the proposed method performed well in comparison to most previous methods. In addition to performing well, the proposed method also provides insight into which words contribute greatly to the classification decision as the learned matrix from the Convolutional Neural Network stores the relative importance of each word. This information can then be used in other applications or analyses. In the future, one can explore the features extracted from the model and use them to potentially learn new methods such as model space. [5]

== Critiques ==

In the '''Method''' section of the paper, some explanations used the same notation for multiple different elements of the model. This made the paper harder to follow and understand since they were referring to different elements by identical notation. Additionally, the decision to use sigmoid and hyperbolic tangent functions as nonlinearities for representation learning, is not supported with evidence that these are optimal.

In '''Comparison of Methods''', the authors discuss the range of hyperparameter settings that they search through. While some of the hyperparameters have a large range of search values, three parameters are fixed without much explanation as to why for all experiments, size of the hidden state of GRU, number of layers, and dropout. These parameters have a lot to do with the complexity of the model and this paper could be improved by providing relevant reasoning behind these values, or by providing additional experimental results over different values of these parameters.

In the '''Results''' section of the paper, they tried to show that the classification results from the CRNN model can be better interpreted than other models. In these explanations, the details were lacking and the authors did not adequately demonstrate how their model is better than others.

Finally, in the '''Results''' section again, the paper compares the CRNN model to several models which they did not implement and reproduce results with. This can be seen in the chart of results above, where several models do not have entries in the table for all six datasets. Since the authors used a subset of the datasets, these other models which were not reproduced could have different accuracy scores if they had been tested on the same data as the CRNN model. This difference in training and testing data is not mentioned in the paper, and the conclusion that the CRNN model is better in all cases may not be valid.

- Could this be applied to hieroglyphs to decipher/better understand them?

It would be interesting to see how the attention matrix is being constructed and how attention values are being determined in each matrix. For instance, does every different subject have its own attention matrix? If so, how will the situation be handled when the same attention matrix is used in different settings?

-This is an interesting topic. I think it will be better to show more results by using this method. Maybe it will be better to put the result part after the architecture part? Writing a motivation will be better since it will catch readers' "eyes". I think it will be interesting to ask: whether can we apply this to ancient Chinese poetry? Since there are lots of types of ancient Chinese poetry, doing a classification for them will be interesting.

This is an interesting method, I would be curious to see if this can be combined or compared with Quasi-Recurrent Neural Networks (https://arxiv.org/abs/1611.01576). In my experience, QRNNs perform similarly to LSTMs while running significantly faster using convolutions with a special temporal pooling. This seems compatible with the neural tensor layer proposed in this paper, which may be combined to yield stronger performance with faster runtimes.

-The paper shows the CRNN model not performing the best with Ag's news and 20newsgroups. It would be interesting to investigate this in detail and see the difference in the way the data is handled in the model compared to the best performing model(self-attentive LSTM in both datasets).

-From experiments, LSTM outperforms CRNN in some cases. It would be interesting to compare CNN+LSTM and CRNN's performance. Another application for CRNN might be classifying spoken language.

- From the Interpreting Learned CRNN Weights part, the samples are labeled as positive and negative, and their words all have opposite emotional polarities. It can be observed that regardless of whether the polarity of the example is positive or negative, the keyword can be extracted by this method, reflecting that it can capture multiple semantically meaningful components. At the same time it will be very interesting to see if this method is applicable to other specific categories.

- The authors of this paper provide 2 examples of what topic classification is, but do not provide any explicit examples of "polysemic words whose meanings are context-sensitive", one of their main critiques of current methods. This is an opportunity to promote the usefulness of their method and engage and inform the reader, simply by listing examples of these words.

- In another [aclweb.org/anthology/W99-0908/ paper] written by Andrew McCallum and Kamal Nigam, they introduce a different method of text classification. Namely, instead of a combination of recurrent and convolutional neural networks, they instead utilized bootstrapping with keywords, Expectation-Maximization algorithm, and shrinkage.

== References ==
----

[1] Grimes, Seth. “Unstructured Data and the 80 Percent Rule.” Breakthrough Analysis, 1 Aug. 2008, breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/.

[2] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modeling sentences,”
arXiv preprint arXiv:1404.2188, 2014.

[3] K. Cho, B. V. Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning
phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.

[4] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in Proceedings
of AAAI, 2015, pp. 2267–2273.

[5] H. Chen, P. Tio, A. Rodan, and X. Yao, “Learning in the model space for cognitive fault diagnosis,” IEEE
Transactions on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 124–136, 2014.

User:Cvmustat

2020-11-30T01:09:10Z

D287zhan: /* Critiques */

== Combine Convolution with Recurrent Networks for Text Classification ==
'''Team Members''': Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea

'''Date''': Week of Nov 23

== Introduction ==

Text classification is the task of assigning a set of predefined categories to natural language texts, it involves learning an embedding layer which allows context dependent classification. It is a fundamental task in Natural Language Processing (NLP) with various applications such as sentiment analysis, and topic classification. A classic example involving text classification is given a set of News articles, is it possible to classify the genre or subject of each article? Text classification is useful as text data is a rich source of information, but extracting insights from it directly can be difficult and time-consuming as most text data is unstructured.[1] NLP text classification can help automatically structure and analyze text quickly and cost-effectively, allowing for individuals to extract important features from the text easier than before.

Text classification work mainly focuses on three topics: feature engineering, feature selection, and the use of different types of machine learning algorithms.
:1. Feature engineering, the most widely used feature is the bag of words feature. Some more complex functions are also designed, such as part-of-speech tags, noun phrases, and tree kernels.
:2. Feature selection aims to remove noisy features and improve classification performance. The most common feature selection method is to delete stop words.
:3. Machine learning algorithms usually use classifiers, such as Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machine (SVM).

In practice, pre-trained word embeddings and deep neural networks are used together for NLP text classification. Word embeddings are used to map the raw text data to an implicit space where the semantic relationships of the words are preserved; words with similar meaning have a similar representation. One can then feed these embeddings into deep neural networks to learn different features of the text. Convolutional neural networks can be used to determine the semantic composition of the text(the meaning), as it treats texts as a 2D matrix by concatenating the embedding of words together. It uses a 1D convolution operator to perform the feature mapping and then conducts a 1D pooling operation over the time domain for obtaining a fixed-length output feature vector, and it can capture both local and position invariant features of the text.[2] Alternatively, Recurrent Neural Networks can be used to determine the contextual meaning of each word in the text (how each word relates to one another) by treating the text as sequential data and then analyzing each word separately. [3] Previous approaches to attempt to combine these two neural networks to incorporate the advantages of both models involve streamlining the two networks, which might decrease their performance. Besides, most methods incorporating a bi-directional Recurrent Neural Network usually concatenate the forward and backward hidden states at each time step, which results in a vector that does not have the interaction information between the forward and backward hidden states.[4] The hidden state in one direction contains only the contextual meaning in that particular direction, however a word's contextual representation, intuitively, is more accurate when collected and viewed from both directions. This paper argues that the failure to observe the meaning of a word in both directions causes the loss of the true meaning of the word, especially for polysemic words (words with more than one meaning) that are context-sensitive.

== Paper Key Contributions ==

This paper suggests an enhanced method of text classification by proposing a new way of combining Convolutional and Recurrent Neural Networks (CRNN) involving the addition of a neural tensor layer. The proposed method maintains each network's respective strengths that are normally lost in previous combination methods. The new suggested architecture is called CRNN, which utilizes both a CNN and RNN that run in parallel on the same input sentence. CNN uses weight matrix learning and produces a 2D matrix that shows the importance of each word based on local and position-invariant features. The bidirectional RNN produces a matrix that learns each word's contextual representation; the words' importance in relation to the rest of the sentence. A neural tensor layer is introduced on top of the RNN to obtain the fusion of bi-directional contextual information surrounding a particular word. This method combines these two matrix representations and classifies the text, providing the important information of each word for prediction, which helps to explain the results. The model also uses dropout and L2 regularization to prevent overfitting.

== CRNN Results vs Benchmarks ==

In order to benchmark the performance of the CRNN model, as well as compare it to other previous efforts, multiple datasets and classification problems were used. All of these datasets are publicly available and can be easily downloaded by any user for testing.

- '''Movie Reviews:''' a sentiment analysis dataset, with two classes (positive and negative).

- '''Yelp:''' a sentiment analysis dataset, with five classes. For this test, a subset of 120,000 reviews was randomly chosen from each class for a total of 600,000 reviews.

- '''AG's News:''' a news categorization dataset, using only the 4 largest classes from the dataset. There are 30000 training samples and 1900 test samples.

- '''20 Newsgroups:''' a news categorization dataset, again using only 4 large classes (comp, politics, rec, and religion) from the dataset.

- '''Sogou News:''' a Chinese news categorization dataset, using 10 major classes as a multi-class classification and include 6500 samples randomly from each class.

- '''Yahoo! Answers:''' a topic classification dataset, with 10 classes and each class contains 140000 training samples and 5000 testing samples.

For the English language datasets, the initial word representations were created using the publicly available ''word2vec'' [https://code.google.com/p/word2vec/] from Google news. For the Chinese language dataset, ''jieba'' [https://github.com/fxsjy/jieba] was used to segment sentences, and then 50-dimensional word vectors were trained on Chinese ''wikipedia'' using ''word2vec''.

A number of other models are run against the same data after preprocessing. Some of these models include:

- '''Self-attentive LSTM:''' an LSTM model with multi-hop attention for sentence embedding.

- '''RCNN:''' the RCNN's recurrent structure allows for increased depth of capture for contextual information. Less noise is introduced on account of the model's holistic structure (compared to local features).

The following results are obtained:

[[File:table of results.png|550px|center]]

The bold results represent the best performing model for a given dataset. These results show that the CRNN model is the best model for 4 of the 6 datasets, with the Self-attentive LSTM beating the CRNN by 0.03 and 0.12 on the news categorization problems. Considering that the CRNN model has better performance than the Self-attentive LSTM on the other 4 datasets, this suggests that the CRNN model is a better performer overall in the conditions of this benchmark.

It should be noted that including the neural tensor layer in the CRNN model leads to a significant performance boost compared to the CRNN models without it. The performance boost can be attributed to the fact that the neural tensor layer captures the surrounding contextual information for each word, and brings this information between the forward and backward RNN in a direct method. As seen in the table, this leads to a better classification accuracy across all datasets.

Another important result was that the CRNN model filter size impacted performance only in the sentiment analysis datasets, as seen in the following:

[[File:filter_effects.png|550px|center]]

== CRNN Model Architecture ==

The CRNN model is a combination of RNN and CNN. It uses CNN to compute the importance of each word in the text and utilizes a neural tensor layer to fuse forward and backward hidden states of bi-directional RNN.

The input of the network is a text, which is a sequence of words. The output of the network is the text representation that is subsequently used as input of a fully-connected layer to obtain the class prediction.

'''RNN Pipeline:'''

The goal of the RNN pipeline is to input each word in a text, and retrieve the contextual information surrounding the word and compute the contextual representation of the word itself. This is accomplished by the use of a bi-directional RNN, such that a Neural Tensor Layer (NTL) can combine the results of the RNN to obtain the final output. RNNs are well-suited to NLP tasks because of their ability to sequentially process data such as ordered text.

A RNN is similar to a feed-forward neural network, but it relies on the use of hidden states. Hidden states are layers in the neural net that produce two outputs: <math> \hat{y}_{t} </math> and <math> h_t </math>. For a time step <math> t </math>, <math> h_t </math> is fed back into the layer to compute <math> \hat{y}_{t+1} </math> and <math> h_{t+1} </math>.

The pipeline will actually use a variant of RNN called GRU, short for Gated Recurrent Units. This is done to address the vanishing gradient problem which causes the network to struggle to memorize words that came earlier in the sequence. Traditional RNNs are only able to remember the most recent words in a sequence, which may be problematic since words that came at the beginning of the sequence that is important to the classification problem may be forgotten. A GRU attempts to solve this by controlling the flow of information through the network using update and reset gates.

Let <math>h_{t-1} \in \mathbb{R}^m, x_t \in \mathbb{R}^d </math> be the inputs, and let <math>\mathbf{W}_z, \mathbf{W}_r, \mathbf{W}_h \in \mathbb{R}^{m \times d}, \mathbf{U}_z, \mathbf{U}_r, \mathbf{U}_h \in \mathbb{R}^{m \times m}</math> be trainable weight matrices. Then the following equations describe the update and reset gates:

<math>
z_t = \sigma(\mathbf{W}_zx_t + \mathbf{U}_zh_{t-1}) \text{update gate} \\
r_t = \sigma(\mathbf{W}_rx_t + \mathbf{U}_rh_{t-1}) \text{reset gate} \\
\tilde{h}_t = \text{tanh}(\mathbf{W}_hx_t + r_t \circ \mathbf{U}_hh_{t-1}) \text{new memory} \\
h_t = (1-z_t)\circ \tilde{h}_t + z_t\circ h_{t-1}
</math>

Note that <math> \sigma, \text{tanh}, \circ </math> are all element-wise functions. The above equations do the following:

<ol>
<li> <math>h_{t-1}</math> carries information from the previous iteration and <math>x_t</math> is the current input </li>
<li> the update gate <math>z_t</math> controls how much past information should be forwarded to the next hidden state </li>
<li> the rest gate <math>r_t</math> controls how much past information is forgotten or reset </li>
<li> new memory <math>\tilde{h}_t</math> contains the relevant past memory as instructed by <math>r_t</math> and current information from the input <math>x_t</math> </li>
<li> then <math>z_t</math> is used to control what is passed on from <math>h_{t-1}</math> and <math>(1-z_t)</math> controls the new memory that is passed on
</ol>

We treat <math>h_0</math> and <math> h_{n+1} </math> as zero vectors in the method. Thus, each <math>h_t</math> can be computed as above to yield results for the bi-directional RNN. The resulting hidden states <math>\overrightarrow{h_t}</math> and <math>\overleftarrow{h_t}</math> contain contextual information around the <math> t</math>-th word in forward and backward directions respectively. Contrary to convention, instead of concatenating these two vectors, it is argued that the word's contextual representation is more precise when the context information from different directions is collected and fused using a neural tensor layer as it permits greater interactions among each element of hidden states. Using these two vectors as input to the neural tensor layer, <math>V^i </math>, we compute a new representation that aggregates meanings from the forward and backward hidden states more accurately as follows:

<math>
[\hat{h_t}]_i = tanh(\overrightarrow{h_t}V^i\overleftarrow{h_t} + b_i)
</math>

Where <math>V^i \in \mathbb{R}^{m \times m} </math> is the learned tensor layer, and <math> b_i \in \mathbb{R} </math> is the bias.We repeat this <math> m </math> times with different <math>V^i </math> matrices and <math> b_i </math> vectors. Through the neural tensor layer, each element in <math> [\hat{h_t}]_i </math> can be viewed as a different type of intersection between the forward and backward hidden states. In the model, <math> [\hat{h_t}]_i </math> will have the same size as the forward and backward hidden states. We then concatenate the three hidden states vectors to form a new vector that summarizes the context information :
<math>
\overleftrightarrow{h_t} = [\overrightarrow{h_t}^T,\overleftarrow{h_t}^T,\hat{h_t}]^T
</math>

We calculate this vector for every word in the text and then stack them all into matrix <math> H </math> with shape <math>n</math>-by-<math>3m</math>.

<math>
H = [\overleftrightarrow{h_1};...\overleftrightarrow{h_n}]
</math>

This <math>H</math> matrix is then forwarded as the results from the Recurrent Neural Network.

'''CNN Pipeline:'''

The goal of the CNN pipeline is to learn the relative importance of words in an input sequence based on different aspects. The process of this CNN pipeline is summarized as the following steps:

<ol>
<li> Given a sequence of words, each word is converted into a word vector using the word2vec algorithm which gives matrix X.
</li>

<li> Word vectors are then convolved through the temporal dimension with filters of various sizes (ie. different K) with learnable weights to capture various numerical K-gram representations. These K-gram representations are stored in matrix C.
</li>

<ul>
<li> The convolution makes this process capture local and position-invariant features. Local means the K words are contiguous. Position-invariant means K contiguous words at any position are detected in this case via convolution.

<li> Temporal dimension example: convolve words from 1 to K, then convolve words 2 to K+1, etc
</li>
</ul>

<li> Since not all K-gram representations are equally meaningful, there is a learnable matrix W which takes the linear combination of K-gram representations to more heavily weigh the more important K-gram representations for the classification task.
</li>

<li> Each linear combination of the K-gram representations gives the relative word importance based on the aspect that the linear combination encodes.
</li>

<li> The relative word importance vs aspect gives rise to an interpretable attention matrix A, where each element says the relative importance of a specific word for a specific aspect.
</li>

</ol>

[[File:Group12_Figure1.png |center]]

<div align="center">Figure 1: The architecture of CRNN.</div>

== Merging RNN & CNN Pipeline Outputs ==

The results from both the RNN and CNN pipeline can be merged by simply multiplying the output matrices. That is, we compute <math>S=A^TH</math> which has shape <math>z \times 3m</math> and is essentially a linear combination of the hidden states. The concatenated rows of S results in a vector in <math>\mathbb{R}^{3zm}</math> and can be passed to a fully connected Softmax layer to output a vector of probabilities for our classification task.

To train the model, we make the following decisions:
<ul>
<li> Use cross-entropy loss as the loss function (A cross-entropy loss function usually takes in two distributions, a true distribution p and an estimated distribution q, and measures the average number of bits need to identify an event. This calculation is independent of the kind of layers used in the network as well as the kind of activation being implemented.) </li>
<li> Perform dropout on random columns in matrix C in the CNN pipeline </li>
<li> Perform L2 regularization on all parameters </li>
<li> Use stochastic gradient descent with a learning rate of 0.001 </li>
</ul>

== Interpreting Learned CRNN Weights ==

Recall that attention matrix A essentially stores the relative importance of every word in the input sequence for every aspect chosen. Naturally, this means that A is an n-by-z matrix, with n being the number of words in the input sequence and z being the number of aspects considered in the classification task.

Furthermore, for any specific aspect, words with higher attention values are more important relative to other words in the same input sequence. likewise, for any specific word, aspects with higher attention values prioritize the specific word more than other aspects.

For example, in this paper, a sentence is sampled from the Movie Reviews dataset, and the transpose of attention matrix A is visualized. Each word represents an element in matrix A, the intensity of red represents the magnitude of an attention value in A, and each sentence is the relative importance of each word for a specific context. In the first row, the words are weighted in terms of a positive aspect, in the last row, the words are weighted in terms of a negative aspect, and in the middle row, the words are weighted in terms of a positive and negative aspect. Notice how the relative importance of words is a function of the aspect.

[[File:Interpretation example.png|800px|center]]

From the above sample, it is interesting that the word "but" is viewed as a negative aspect. From a linguistic perspective, the semantic of "but" is incredibly difficult to capture because of the degree of contextual information it needs. In this case, "but" is in the middle of a transition from a negative to a positive so the first row should also have given attention to that word. Also, it seems that the model has learned to give very high attention to the two words directly adjacent to the word of high attention: "is" and "and" beside "powerful", and "an" and "cast" beside "unwieldy".

== Conclusion & Summary ==

This paper proposed a new architecture, the Convolutional Recurrent Neural Network, for text classification. The Convolutional Neural Network is used to learn the relative importance of each word from their different aspects and stores it this information into a weight matrix. The Recurrent Neural Network learns each word's contextual representation through the combination of the forward and backward context information that is fused using a neural tensor layer and is stored as a matrix. These two matrices are then combined to get the text representation used for classification. Although the specifics of the performed tests are lacking, the experiment's results indicate that the proposed method performed well in comparison to most previous methods. In addition to performing well, the proposed method also provides insight into which words contribute greatly to the classification decision as the learned matrix from the Convolutional Neural Network stores the relative importance of each word. This information can then be used in other applications or analyses. In the future, one can explore the features extracted from the model and use them to potentially learn new methods such as model space. [5]

== Critiques ==

In the '''Method''' section of the paper, some explanations used the same notation for multiple different elements of the model. This made the paper harder to follow and understand since they were referring to different elements by identical notation. Additionally, the decision to use sigmoid and hyperbolic tangent functions as nonlinearities for representation learning, is not supported with evidence that these are optimal.

In '''Comparison of Methods''', the authors discuss the range of hyperparameter settings that they search through. While some of the hyperparameters have a large range of search values, three parameters are fixed without much explanation as to why for all experiments, size of the hidden state of GRU, number of layers, and dropout. These parameters have a lot to do with the complexity of the model and this paper could be improved by providing relevant reasoning behind these values, or by providing additional experimental results over different values of these parameters.

In the '''Results''' section of the paper, they tried to show that the classification results from the CRNN model can be better interpreted than other models. In these explanations, the details were lacking and the authors did not adequately demonstrate how their model is better than others.

Finally, in the '''Results''' section again, the paper compares the CRNN model to several models which they did not implement and reproduce results with. This can be seen in the chart of results above, where several models do not have entries in the table for all six datasets. Since the authors used a subset of the datasets, these other models which were not reproduced could have different accuracy scores if they had been tested on the same data as the CRNN model. This difference in training and testing data is not mentioned in the paper, and the conclusion that the CRNN model is better in all cases may not be valid.

- Could this be applied to hieroglyphs to decipher/better understand them?

It would be interesting to see how the attention matrix is being constructed and how attention values are being determined in each matrix. For instance, does every different subject have its own attention matrix? If so, how will the situation be handled when the same attention matrix is used in different settings?

-This is an interesting topic. I think it will be better to show more results by using this method. Maybe it will be better to put the result part after the architecture part? Writing a motivation will be better since it will catch readers' "eyes". I think it will be interesting to ask: whether can we apply this to ancient Chinese poetry? Since there are lots of types of ancient Chinese poetry, doing a classification for them will be interesting.

This is an interesting method, I would be curious to see if this can be combined or compared with Quasi-Recurrent Neural Networks (https://arxiv.org/abs/1611.01576). In my experience, QRNNs perform similarly to LSTMs while running significantly faster using convolutions with a special temporal pooling. This seems compatible with the neural tensor layer proposed in this paper, which may be combined to yield stronger performance with faster runtimes.

-The paper shows the CRNN model not performing the best with Ag's news and 20newsgroups. It would be interesting to investigate this in detail and see the difference in the way the data is handled in the model compared to the best performing model(self-attentive LSTM in both datasets).

-From experiments, LSTM outperforms CRNN in some cases. It would be interesting to compare CNN+LSTM and CRNN's performance. Another application for CRNN might be classifying spoken language.

- From the Interpreting Learned CRNN Weights part, the samples are labeled as positive and negative, and their words all have opposite emotional polarities. It can be observed that regardless of whether the polarity of the example is positive or negative, the keyword can be extracted by this method, reflecting that it can capture multiple semantically meaningful components. At the same time it will be very interesting to see if this method is applicable to other specific categories.

- The authors of this paper provide 2 examples of what topic classification is, but do not provide any explicit examples of "polysemic words whose meanings are context-sensitive", one of their main critiques of current methods. This is an opportunity to promote the usefulness of their method and engage and inform the reader, simply by listing examples of these words.

- In another [aclweb.org/anthology/W99-0908.pdf paper] written by Andrew McCallum and Kamal Nigam, they introduce a different method of text classification. Namely, instead of a combination of recurrent and convolutional neural networks, they instead utilized bootstrapping with keywords, Expectation-Maximization algorithm, and shrinkage.

== References ==
----

[1] Grimes, Seth. “Unstructured Data and the 80 Percent Rule.” Breakthrough Analysis, 1 Aug. 2008, breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/.

[2] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modeling sentences,”
arXiv preprint arXiv:1404.2188, 2014.

[3] K. Cho, B. V. Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning
phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.

[4] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in Proceedings
of AAAI, 2015, pp. 2267–2273.

[5] H. Chen, P. Tio, A. Rodan, and X. Yao, “Learning in the model space for cognitive fault diagnosis,” IEEE
Transactions on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 124–136, 2014.

User:Cvmustat

2020-11-30T01:08:08Z

D287zhan: /* Critiques */

== Combine Convolution with Recurrent Networks for Text Classification ==
'''Team Members''': Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea

'''Date''': Week of Nov 23

== Introduction ==

Text classification is the task of assigning a set of predefined categories to natural language texts, it involves learning an embedding layer which allows context dependent classification. It is a fundamental task in Natural Language Processing (NLP) with various applications such as sentiment analysis, and topic classification. A classic example involving text classification is given a set of News articles, is it possible to classify the genre or subject of each article? Text classification is useful as text data is a rich source of information, but extracting insights from it directly can be difficult and time-consuming as most text data is unstructured.[1] NLP text classification can help automatically structure and analyze text quickly and cost-effectively, allowing for individuals to extract important features from the text easier than before.

Text classification work mainly focuses on three topics: feature engineering, feature selection, and the use of different types of machine learning algorithms.
:1. Feature engineering, the most widely used feature is the bag of words feature. Some more complex functions are also designed, such as part-of-speech tags, noun phrases, and tree kernels.
:2. Feature selection aims to remove noisy features and improve classification performance. The most common feature selection method is to delete stop words.
:3. Machine learning algorithms usually use classifiers, such as Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machine (SVM).

In practice, pre-trained word embeddings and deep neural networks are used together for NLP text classification. Word embeddings are used to map the raw text data to an implicit space where the semantic relationships of the words are preserved; words with similar meaning have a similar representation. One can then feed these embeddings into deep neural networks to learn different features of the text. Convolutional neural networks can be used to determine the semantic composition of the text(the meaning), as it treats texts as a 2D matrix by concatenating the embedding of words together. It uses a 1D convolution operator to perform the feature mapping and then conducts a 1D pooling operation over the time domain for obtaining a fixed-length output feature vector, and it can capture both local and position invariant features of the text.[2] Alternatively, Recurrent Neural Networks can be used to determine the contextual meaning of each word in the text (how each word relates to one another) by treating the text as sequential data and then analyzing each word separately. [3] Previous approaches to attempt to combine these two neural networks to incorporate the advantages of both models involve streamlining the two networks, which might decrease their performance. Besides, most methods incorporating a bi-directional Recurrent Neural Network usually concatenate the forward and backward hidden states at each time step, which results in a vector that does not have the interaction information between the forward and backward hidden states.[4] The hidden state in one direction contains only the contextual meaning in that particular direction, however a word's contextual representation, intuitively, is more accurate when collected and viewed from both directions. This paper argues that the failure to observe the meaning of a word in both directions causes the loss of the true meaning of the word, especially for polysemic words (words with more than one meaning) that are context-sensitive.

== Paper Key Contributions ==

This paper suggests an enhanced method of text classification by proposing a new way of combining Convolutional and Recurrent Neural Networks (CRNN) involving the addition of a neural tensor layer. The proposed method maintains each network's respective strengths that are normally lost in previous combination methods. The new suggested architecture is called CRNN, which utilizes both a CNN and RNN that run in parallel on the same input sentence. CNN uses weight matrix learning and produces a 2D matrix that shows the importance of each word based on local and position-invariant features. The bidirectional RNN produces a matrix that learns each word's contextual representation; the words' importance in relation to the rest of the sentence. A neural tensor layer is introduced on top of the RNN to obtain the fusion of bi-directional contextual information surrounding a particular word. This method combines these two matrix representations and classifies the text, providing the important information of each word for prediction, which helps to explain the results. The model also uses dropout and L2 regularization to prevent overfitting.

== CRNN Results vs Benchmarks ==

In order to benchmark the performance of the CRNN model, as well as compare it to other previous efforts, multiple datasets and classification problems were used. All of these datasets are publicly available and can be easily downloaded by any user for testing.

- '''Movie Reviews:''' a sentiment analysis dataset, with two classes (positive and negative).

- '''Yelp:''' a sentiment analysis dataset, with five classes. For this test, a subset of 120,000 reviews was randomly chosen from each class for a total of 600,000 reviews.

- '''AG's News:''' a news categorization dataset, using only the 4 largest classes from the dataset. There are 30000 training samples and 1900 test samples.

- '''20 Newsgroups:''' a news categorization dataset, again using only 4 large classes (comp, politics, rec, and religion) from the dataset.

- '''Sogou News:''' a Chinese news categorization dataset, using 10 major classes as a multi-class classification and include 6500 samples randomly from each class.

- '''Yahoo! Answers:''' a topic classification dataset, with 10 classes and each class contains 140000 training samples and 5000 testing samples.

For the English language datasets, the initial word representations were created using the publicly available ''word2vec'' [https://code.google.com/p/word2vec/] from Google news. For the Chinese language dataset, ''jieba'' [https://github.com/fxsjy/jieba] was used to segment sentences, and then 50-dimensional word vectors were trained on Chinese ''wikipedia'' using ''word2vec''.

A number of other models are run against the same data after preprocessing. Some of these models include:

- '''Self-attentive LSTM:''' an LSTM model with multi-hop attention for sentence embedding.

- '''RCNN:''' the RCNN's recurrent structure allows for increased depth of capture for contextual information. Less noise is introduced on account of the model's holistic structure (compared to local features).

The following results are obtained:

[[File:table of results.png|550px|center]]

The bold results represent the best performing model for a given dataset. These results show that the CRNN model is the best model for 4 of the 6 datasets, with the Self-attentive LSTM beating the CRNN by 0.03 and 0.12 on the news categorization problems. Considering that the CRNN model has better performance than the Self-attentive LSTM on the other 4 datasets, this suggests that the CRNN model is a better performer overall in the conditions of this benchmark.

It should be noted that including the neural tensor layer in the CRNN model leads to a significant performance boost compared to the CRNN models without it. The performance boost can be attributed to the fact that the neural tensor layer captures the surrounding contextual information for each word, and brings this information between the forward and backward RNN in a direct method. As seen in the table, this leads to a better classification accuracy across all datasets.

Another important result was that the CRNN model filter size impacted performance only in the sentiment analysis datasets, as seen in the following:

[[File:filter_effects.png|550px|center]]

== CRNN Model Architecture ==

The CRNN model is a combination of RNN and CNN. It uses CNN to compute the importance of each word in the text and utilizes a neural tensor layer to fuse forward and backward hidden states of bi-directional RNN.

The input of the network is a text, which is a sequence of words. The output of the network is the text representation that is subsequently used as input of a fully-connected layer to obtain the class prediction.

'''RNN Pipeline:'''

The goal of the RNN pipeline is to input each word in a text, and retrieve the contextual information surrounding the word and compute the contextual representation of the word itself. This is accomplished by the use of a bi-directional RNN, such that a Neural Tensor Layer (NTL) can combine the results of the RNN to obtain the final output. RNNs are well-suited to NLP tasks because of their ability to sequentially process data such as ordered text.

A RNN is similar to a feed-forward neural network, but it relies on the use of hidden states. Hidden states are layers in the neural net that produce two outputs: <math> \hat{y}_{t} </math> and <math> h_t </math>. For a time step <math> t </math>, <math> h_t </math> is fed back into the layer to compute <math> \hat{y}_{t+1} </math> and <math> h_{t+1} </math>.

The pipeline will actually use a variant of RNN called GRU, short for Gated Recurrent Units. This is done to address the vanishing gradient problem which causes the network to struggle to memorize words that came earlier in the sequence. Traditional RNNs are only able to remember the most recent words in a sequence, which may be problematic since words that came at the beginning of the sequence that is important to the classification problem may be forgotten. A GRU attempts to solve this by controlling the flow of information through the network using update and reset gates.

Let <math>h_{t-1} \in \mathbb{R}^m, x_t \in \mathbb{R}^d </math> be the inputs, and let <math>\mathbf{W}_z, \mathbf{W}_r, \mathbf{W}_h \in \mathbb{R}^{m \times d}, \mathbf{U}_z, \mathbf{U}_r, \mathbf{U}_h \in \mathbb{R}^{m \times m}</math> be trainable weight matrices. Then the following equations describe the update and reset gates:

<math>
z_t = \sigma(\mathbf{W}_zx_t + \mathbf{U}_zh_{t-1}) \text{update gate} \\
r_t = \sigma(\mathbf{W}_rx_t + \mathbf{U}_rh_{t-1}) \text{reset gate} \\
\tilde{h}_t = \text{tanh}(\mathbf{W}_hx_t + r_t \circ \mathbf{U}_hh_{t-1}) \text{new memory} \\
h_t = (1-z_t)\circ \tilde{h}_t + z_t\circ h_{t-1}
</math>

Note that <math> \sigma, \text{tanh}, \circ </math> are all element-wise functions. The above equations do the following:

<ol>
<li> <math>h_{t-1}</math> carries information from the previous iteration and <math>x_t</math> is the current input </li>
<li> the update gate <math>z_t</math> controls how much past information should be forwarded to the next hidden state </li>
<li> the rest gate <math>r_t</math> controls how much past information is forgotten or reset </li>
<li> new memory <math>\tilde{h}_t</math> contains the relevant past memory as instructed by <math>r_t</math> and current information from the input <math>x_t</math> </li>
<li> then <math>z_t</math> is used to control what is passed on from <math>h_{t-1}</math> and <math>(1-z_t)</math> controls the new memory that is passed on
</ol>

We treat <math>h_0</math> and <math> h_{n+1} </math> as zero vectors in the method. Thus, each <math>h_t</math> can be computed as above to yield results for the bi-directional RNN. The resulting hidden states <math>\overrightarrow{h_t}</math> and <math>\overleftarrow{h_t}</math> contain contextual information around the <math> t</math>-th word in forward and backward directions respectively. Contrary to convention, instead of concatenating these two vectors, it is argued that the word's contextual representation is more precise when the context information from different directions is collected and fused using a neural tensor layer as it permits greater interactions among each element of hidden states. Using these two vectors as input to the neural tensor layer, <math>V^i </math>, we compute a new representation that aggregates meanings from the forward and backward hidden states more accurately as follows:

<math>
[\hat{h_t}]_i = tanh(\overrightarrow{h_t}V^i\overleftarrow{h_t} + b_i)
</math>

Where <math>V^i \in \mathbb{R}^{m \times m} </math> is the learned tensor layer, and <math> b_i \in \mathbb{R} </math> is the bias.We repeat this <math> m </math> times with different <math>V^i </math> matrices and <math> b_i </math> vectors. Through the neural tensor layer, each element in <math> [\hat{h_t}]_i </math> can be viewed as a different type of intersection between the forward and backward hidden states. In the model, <math> [\hat{h_t}]_i </math> will have the same size as the forward and backward hidden states. We then concatenate the three hidden states vectors to form a new vector that summarizes the context information :
<math>
\overleftrightarrow{h_t} = [\overrightarrow{h_t}^T,\overleftarrow{h_t}^T,\hat{h_t}]^T
</math>

We calculate this vector for every word in the text and then stack them all into matrix <math> H </math> with shape <math>n</math>-by-<math>3m</math>.

<math>
H = [\overleftrightarrow{h_1};...\overleftrightarrow{h_n}]
</math>

This <math>H</math> matrix is then forwarded as the results from the Recurrent Neural Network.

'''CNN Pipeline:'''

The goal of the CNN pipeline is to learn the relative importance of words in an input sequence based on different aspects. The process of this CNN pipeline is summarized as the following steps:

<ol>
<li> Given a sequence of words, each word is converted into a word vector using the word2vec algorithm which gives matrix X.
</li>

<li> Word vectors are then convolved through the temporal dimension with filters of various sizes (ie. different K) with learnable weights to capture various numerical K-gram representations. These K-gram representations are stored in matrix C.
</li>

<ul>
<li> The convolution makes this process capture local and position-invariant features. Local means the K words are contiguous. Position-invariant means K contiguous words at any position are detected in this case via convolution.

<li> Temporal dimension example: convolve words from 1 to K, then convolve words 2 to K+1, etc
</li>
</ul>

<li> Since not all K-gram representations are equally meaningful, there is a learnable matrix W which takes the linear combination of K-gram representations to more heavily weigh the more important K-gram representations for the classification task.
</li>

<li> Each linear combination of the K-gram representations gives the relative word importance based on the aspect that the linear combination encodes.
</li>

<li> The relative word importance vs aspect gives rise to an interpretable attention matrix A, where each element says the relative importance of a specific word for a specific aspect.
</li>

</ol>

[[File:Group12_Figure1.png |center]]

<div align="center">Figure 1: The architecture of CRNN.</div>

== Merging RNN & CNN Pipeline Outputs ==

The results from both the RNN and CNN pipeline can be merged by simply multiplying the output matrices. That is, we compute <math>S=A^TH</math> which has shape <math>z \times 3m</math> and is essentially a linear combination of the hidden states. The concatenated rows of S results in a vector in <math>\mathbb{R}^{3zm}</math> and can be passed to a fully connected Softmax layer to output a vector of probabilities for our classification task.

To train the model, we make the following decisions:
<ul>
<li> Use cross-entropy loss as the loss function (A cross-entropy loss function usually takes in two distributions, a true distribution p and an estimated distribution q, and measures the average number of bits need to identify an event. This calculation is independent of the kind of layers used in the network as well as the kind of activation being implemented.) </li>
<li> Perform dropout on random columns in matrix C in the CNN pipeline </li>
<li> Perform L2 regularization on all parameters </li>
<li> Use stochastic gradient descent with a learning rate of 0.001 </li>
</ul>

== Interpreting Learned CRNN Weights ==

Recall that attention matrix A essentially stores the relative importance of every word in the input sequence for every aspect chosen. Naturally, this means that A is an n-by-z matrix, with n being the number of words in the input sequence and z being the number of aspects considered in the classification task.

Furthermore, for any specific aspect, words with higher attention values are more important relative to other words in the same input sequence. likewise, for any specific word, aspects with higher attention values prioritize the specific word more than other aspects.

For example, in this paper, a sentence is sampled from the Movie Reviews dataset, and the transpose of attention matrix A is visualized. Each word represents an element in matrix A, the intensity of red represents the magnitude of an attention value in A, and each sentence is the relative importance of each word for a specific context. In the first row, the words are weighted in terms of a positive aspect, in the last row, the words are weighted in terms of a negative aspect, and in the middle row, the words are weighted in terms of a positive and negative aspect. Notice how the relative importance of words is a function of the aspect.

[[File:Interpretation example.png|800px|center]]

From the above sample, it is interesting that the word "but" is viewed as a negative aspect. From a linguistic perspective, the semantic of "but" is incredibly difficult to capture because of the degree of contextual information it needs. In this case, "but" is in the middle of a transition from a negative to a positive so the first row should also have given attention to that word. Also, it seems that the model has learned to give very high attention to the two words directly adjacent to the word of high attention: "is" and "and" beside "powerful", and "an" and "cast" beside "unwieldy".

== Conclusion & Summary ==

This paper proposed a new architecture, the Convolutional Recurrent Neural Network, for text classification. The Convolutional Neural Network is used to learn the relative importance of each word from their different aspects and stores it this information into a weight matrix. The Recurrent Neural Network learns each word's contextual representation through the combination of the forward and backward context information that is fused using a neural tensor layer and is stored as a matrix. These two matrices are then combined to get the text representation used for classification. Although the specifics of the performed tests are lacking, the experiment's results indicate that the proposed method performed well in comparison to most previous methods. In addition to performing well, the proposed method also provides insight into which words contribute greatly to the classification decision as the learned matrix from the Convolutional Neural Network stores the relative importance of each word. This information can then be used in other applications or analyses. In the future, one can explore the features extracted from the model and use them to potentially learn new methods such as model space. [5]

== Critiques ==

In the '''Method''' section of the paper, some explanations used the same notation for multiple different elements of the model. This made the paper harder to follow and understand since they were referring to different elements by identical notation. Additionally, the decision to use sigmoid and hyperbolic tangent functions as nonlinearities for representation learning, is not supported with evidence that these are optimal.

In '''Comparison of Methods''', the authors discuss the range of hyperparameter settings that they search through. While some of the hyperparameters have a large range of search values, three parameters are fixed without much explanation as to why for all experiments, size of the hidden state of GRU, number of layers, and dropout. These parameters have a lot to do with the complexity of the model and this paper could be improved by providing relevant reasoning behind these values, or by providing additional experimental results over different values of these parameters.

In the '''Results''' section of the paper, they tried to show that the classification results from the CRNN model can be better interpreted than other models. In these explanations, the details were lacking and the authors did not adequately demonstrate how their model is better than others.

Finally, in the '''Results''' section again, the paper compares the CRNN model to several models which they did not implement and reproduce results with. This can be seen in the chart of results above, where several models do not have entries in the table for all six datasets. Since the authors used a subset of the datasets, these other models which were not reproduced could have different accuracy scores if they had been tested on the same data as the CRNN model. This difference in training and testing data is not mentioned in the paper, and the conclusion that the CRNN model is better in all cases may not be valid.

- Could this be applied to hieroglyphs to decipher/better understand them?

It would be interesting to see how the attention matrix is being constructed and how attention values are being determined in each matrix. For instance, does every different subject have its own attention matrix? If so, how will the situation be handled when the same attention matrix is used in different settings?

-This is an interesting topic. I think it will be better to show more results by using this method. Maybe it will be better to put the result part after the architecture part? Writing a motivation will be better since it will catch readers' "eyes". I think it will be interesting to ask: whether can we apply this to ancient Chinese poetry? Since there are lots of types of ancient Chinese poetry, doing a classification for them will be interesting.

This is an interesting method, I would be curious to see if this can be combined or compared with Quasi-Recurrent Neural Networks (https://arxiv.org/abs/1611.01576). In my experience, QRNNs perform similarly to LSTMs while running significantly faster using convolutions with a special temporal pooling. This seems compatible with the neural tensor layer proposed in this paper, which may be combined to yield stronger performance with faster runtimes.

-The paper shows the CRNN model not performing the best with Ag's news and 20newsgroups. It would be interesting to investigate this in detail and see the difference in the way the data is handled in the model compared to the best performing model(self-attentive LSTM in both datasets).

-From experiments, LSTM outperforms CRNN in some cases. It would be interesting to compare CNN+LSTM and CRNN's performance. Another application for CRNN might be classifying spoken language.

- From the Interpreting Learned CRNN Weights part, the samples are labeled as positive and negative, and their words all have opposite emotional polarities. It can be observed that regardless of whether the polarity of the example is positive or negative, the keyword can be extracted by this method, reflecting that it can capture multiple semantically meaningful components. At the same time it will be very interesting to see if this method is applicable to other specific categories.

- The authors of this paper provide 2 examples of what topic classification is, but do not provide any explicit examples of "polysemic words whose meanings are context-sensitive", one of their main critiques of current methods. This is an opportunity to promote the usefulness of their method and engage and inform the reader, simply by listing examples of these words.

- In another [aclweb.org/anthology/W99-0908.pdf paper] written by Andrew McCallum and Kamal Nigam, they introduce a different method of text classfication. Namely, instead of ___ , they instead utilized bootstrapping with keywords, expectation-maximization algorithm, and shrinkage.

== References ==
----

[1] Grimes, Seth. “Unstructured Data and the 80 Percent Rule.” Breakthrough Analysis, 1 Aug. 2008, breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/.

[2] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modeling sentences,”
arXiv preprint arXiv:1404.2188, 2014.

[3] K. Cho, B. V. Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning
phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.

[4] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in Proceedings
of AAAI, 2015, pp. 2267–2273.

[5] H. Chen, P. Tio, A. Rodan, and X. Yao, “Learning in the model space for cognitive fault diagnosis,” IEEE
Transactions on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 124–136, 2014.

Surround Vehicle Motion Prediction

2020-11-30T00:53:26Z

D287zhan: /* Critiques */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.

=== Motion predictor ===
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
==== Encoder and decoder ====
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Sequence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== Motion planning application ===
==== Case study of a multi-lane left turn scenario ====
The proposed method mimics a human driver better, by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle, even when the target vehicle was not following the intersection guideline.

==== Statistical analysis of motion planning application results ====
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behaviour of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math>
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed
model is efficient and safe.

== Conclusion ==
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.

There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.

It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.

The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].

[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.

Surround Vehicle Motion Prediction

2020-11-30T00:53:19Z

D287zhan: /* Critiques */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.

=== Motion predictor ===
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
==== Encoder and decoder ====
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Sequence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== Motion planning application ===
==== Case study of a multi-lane left turn scenario ====
The proposed method mimics a human driver better, by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle, even when the target vehicle was not following the intersection guideline.

==== Statistical analysis of motion planning application results ====
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behaviour of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math>
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed
model is efficient and safe.

== Conclusion ==
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.

There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.

It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.

The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies,such as Lyft and Uber, have started testing their own commercial autonomous vehicles.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].

[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.

Efficient kNN Classification with Different Numbers of Nearest Neighbors

2020-11-30T00:37:41Z

D287zhan: /* Critiques */

== Presented by ==
Cooper Brooke, Daniel Fagan, Maya Perelman

== Introduction ==
Unlike popular parametric approaches, which first utilizes training observations to learn a model and subsequently predict test samples with this model, the non-parametric k-nearest neighbors (kNNs) method classifies observations with a majority rules approach. kNN classifies test samples by assigning them the label of their k closest training observation (neighbours). This method has become very popular due to its significant performance and easy implementation.

There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples. The second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.

== Previous Work ==

Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications.

Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.

== Motivation ==

Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seek to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.

A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.

== Approach ==

=== kTree Classification ===

The proposed kTree method is illustrated by the following flow chart:

[[File:Approach_Figure_1.png | center | 800x800px]]

==== Reconstruction ====

The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}</math> represents the training set can be written as:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2
\end{aligned}$$

In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples.

The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:

$$\begin{aligned}
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})
\end{aligned}$$

where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features.

This gives a final objective function of:

$$\begin{aligned}
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})
\end{aligned}$$

Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.

==== Calculate ''k'' for training set ====

Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.

For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:

[[File:Approach_Figure_2.png | center | 300x300px]]

The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 is non-zero.

==== Train a Decision Tree using ''k'' as the label ====

As opposed to a normal decision tree where the target data is the labels themselves, the target data when using the kTree method are the optimal ''k'' values for each sample that were solved for in the previous step so this decision tree has the following form:

[[File:Approach_Figure_3.png | center | 300x300px]]

==== Making Predictions for Test Data ====

The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbours across '''all''' of the training data.

=== k*Tree Classification ===

The proposed k*Tree method is illustrated by the following flow chart:

[[File:Approach_Figure_4.png | center | 800x800px]]

Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.

While all steps previous are the exact same, the k*Tree method not only stores the optimal ''k'' value but also the following information:

* The training samples that have the same optimal ''k''
* The ''k'' nearest neighbours of the previously identified training samples
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours

The data stored in each node is summarized in the following figure:

[[File:Approach_Figure_5.png | center | 800x800px]]

In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.

== Experiments ==

In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:

# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.
# kNN-Based Applicability Domain Approach (AD-kNN) [11]
# kNN Method Based on Sparse Learning (S-kNN) [10]
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]

The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features.

'''A. Experimental Results on Different Sample Sizes'''

The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.

[[File:Table_I_kNN.png | center | 800x800px]]

The following key results are noted:
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.

'''B. Experimental Results on Different Feature Numbers'''

The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score [15] approach was used to rank and select the most information features in the datasets.

[[File:Table_II_kNN.png | center | 800x800px]]

From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.

== Conclusion ==

This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods achieve efficient classification by designing a training step that reduces the run time of the test stage. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features.

== Critiques ==

*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy.
* The authors addressed that some of the UCI datasets contained imbalance data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.).
*While the authors contrast their ktTee and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.

* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.

* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.

== References ==

[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.

[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.

[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.

[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.

[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.

[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.

[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.

[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.

[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.

[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.

[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.

[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.

[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.

[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.

Superhuman AI for Multiplayer Poker

2020-11-30T00:25:43Z

D287zhan: /* Discussion and Critiques */

== Presented by ==
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty

== Introduction ==

A superintelligence is a hypothetical agent that possesses intelligence far surpassing that of the brightest and most gifted human minds. In the past two decades, most of the superhuman AI that was built can only beat human players in two-player zero-sum games. They almost dominated most of the board games in these twenty years. The most popular AI in the board games are the chess AI deep blue and the go chess AI Alpha-go. The most common strategy that the AI uses to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is a pair of strategies such that either single-player switching to any ''other'' choice of strategy (while the other player's strategy remains unchanged) will result in a lower payout for the switching player. Intuitively this is similar to a locally optimal strategy for the players but is (i) not guaranteed to exist and (ii) may not be the truly optimal strategy. An example of this is the Prisoner's dilemma, where two individuals each have the option to testify against the other or to remain silent. Although the optimal choice is to remain silent, the individuals have an incentive to act in their own self-interest which results in a less than optimal outcome.

More specifically, in the game of poker, we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.

In this paper, the AI which we call Pluribus is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used is not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. This shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.

== Nash Equilibrium in Multiplayer Games ==

Many AI has reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to exist in all finite games and numerous infinite games. However, the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games since it guarantees to not lose in expectation regardless of what the opponent is doing.

To have a deeper understanding of Nash Equilibria we must first define some basic game theory concepts. The first one being a strategic game, in game theory a strategic game consists of a set of players, for each player a set of actions and for each player preferences (or payoffs) over the set of action profiles (set of combination of actions). With these three elements, we can model a wide variety of situations. Now a Nash Equilibrium is an action profile, with the property that no player can do better by changing their action, given that all other players' actions remain the same. A common illustration of Nash equilibria is the Prisoner's Dilemma. We also have mixed strategies and mixed strategy Nash equilibria. A mixed strategy is when instead of a player choosing an action they apply a probability distribution to their set of actions and pick randomly. Note that with mixed strategies we must look at the expected payoff of the player given the other players' strategies. Therefore a mixed strategy Nash Equilibria involves at least one player playing with a mixed strategy where no player can increase their expected payoff by changing their action, given that all other players' actions remain the same. Then we can define a pure Nash Equilibria to where no one is playing a mixed strategy. We also must be aware that a single game can have multiple pure Nash equilibria and mixed Nash equilibria. Also, Nash Equilibria are purely theoretical and depend on players acting optimally and being rational, this is not always the case with humans and we can act very irrationally. Therefore empirically we will see that games can have very unexpected outcomes and you may be able to get a better payoff if you move away from a strictly theoretical strategy and take advantage of your opponent's irrational behavior.

The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. At the Nash equilibrium, there is no incentive for any player to change their initial strategy, so it is a stable state of the system. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example, our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy overtime to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, shifting away from the Nash equilibrium strategy opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.

Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently, existing techniques to find ways to exploit an opponent require way too many samples and are not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a great challenge. Even we can efficiently compute a Nash equilibrium in games with more than two players, it is still highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.

Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colors which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.

[[File:Lemonade_Example.png| 600px |center ]]

<div align="center">Figure 1: Lemonade Stand Example</div>

From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibria, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.

== Theoretical Analysis ==
Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of abstraction: Action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (the exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.

Pluribus uses a built-in strategy - “Blueprint strategy”, which it gradually improves by searching in real-time in situations it finds itself in during the course of the game. In the first betting round, pluribus uses the initial blueprint strategy when the number of decision points is small. The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken.

[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]

<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div>

At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by a player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. The AI compares its decision with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, the Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with your decision and zero regret indicates that you are indifferent.

The value of counterfactual regret for a decision is adjusted over the iterations as more scenarios or decision points are encountered. This means at the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret are chosen with higher probability. CFR minimizes regret over many iterations until the average strategy overall iterations converge and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T. This causes the influence of the first iteration to decay at a rate of <math>\frac{1}{\sum_{t=1}^Tt} = \frac{2}{T(T+1)}</math>, compared to a rate of <math>\frac{1}{T}</math> in the original CFR algorithm. This leads to the strategy of improving more quickly in practice.

An additional feature of Pluribus is that in the subgames, instead of assuming that all players play according to a single strategy, Pluribus considers that each player may choose between k different strategies specialized to each player when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance, if a player never bluffs while holding the best possible hand then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a balanced strategy than to bet.
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real-time decisions during the game.

== Experimental Results ==
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played.

The second format included one human player and 5 copies of Pluribus (1H+5AI). There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.
The performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.

Applying AIVAT the following were the results:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable
|-
! scope="row" | 5H+1AI
| 48 || 25 || 0.028
|-
! scope="row" | 1H+5AI
| 32 || 15 || 0.014
|}
[[File:top.PNG| 950px | x450px |left]]

<div align="center">"Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot."</div>

Optimal play in Pluribus looks different from well-known poker conventions: A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans and is proven to be profitable.

== Discussion and Critiques ==

Pluribus' Blueprint strategy and Abstraction methods effectively reduce the computational power required. Hence it was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers.

Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using a non-theoretical approach in more real-life problems such as autonomous driving or stock market trading.

Extending this idea beyond two-player zero-sum games will have many applications in real life.

The summary for Superhuman AI for Multiplayer Poker is very well written, with a detailed explanation of the concept, steps, and result and with a combination of visual images. However, it seems that the experiment of the study is not well designed. For example, sample selection is not strict and well defined, this could cause selection bias introduced into the result and thus making it not generalizable.

Superhuman AI, while sounding superior, is actually not uncommon. There have been many endeavours on mastering poker such as the Recursive Belief-based Learning (ReBeL) by Facebook Research. They pursued a method of reinforcement learning on a partially observable Markov decision process which was inspired by the recent successes of AlphaZero. For Pluribus to demonstrate how effective it is compared to the state-of-the-art, it should run some experiments against ReBeL.

This is a very interesting topic, and this summary is clear enough for readers to understand. I think this application not only can apply in poker, maybe thinking of more applications in other areas? There are many famous AI that really changing our life. For example, AlphaGo and AlphaStar, which are developed by Google DeepMind, defeated professional gamers. Discussing more this will be interesting.

One of the biggest issues when applying AI to games against humans (when not all information is known, ie, opponents' cards) is the assumption is generally made that the human players are rational players which follow a certain set of "rules" based on the information that they know. This could be an issue with the fact that Pluribus has trained itself by playing itself instead of humans. While the results clearly show that Pluribus has found some kind of 'optimal' method to play, it would be interesting to see if it could actually maximize its profits by learning the trends of its human opponents over time (learning on the fly with information gained each hand while it's playing). In addition to that, the paper may discuss how human action could be changed in the game when they play with Superhuman AI. We can see that playing card games require various strategy and different people can have a different set of actions in the same game and in the same situation.

One interesting software called Piosolver leverages a similar tree-based algorithm presented in the paper to recommend the move that is deemed game theory optimal (GTO). In the poker world, GTO is a play-style that is based on mathematics and is considered a "defensive" strategy. Following the rock, paper, scissors analogy from the paper, a GTO play-style is synonymous with choosing randomly between the three options, whereas an exploitative strategy involves reading a human player's tendencies and adjusting the strategy accordingly. Piosolver is used by many professional poker players to enhance their game and gain intuition on what the best move is in certain situations.

Another way to train the proposed model can be a poker game with two or more AI players. That method was used by AlphaGo to train a better model.

Games with various AI players would be an interesting topic, and through comparing different AI, their shortcomes could be observed and improved. More discussions on this would be of interests.

Similar to Pluribis, another [https://science.sciencemag.org/content/356/6337/508 paper] discussed a different AI program, called DeepStack, which also has defeated professional poker players at a 2-player Texas hold'em variant. However, instead of finding the Nash Equilibria, DeepStack uses recursive reasoning, decomposition, and a form of intuition that is automatically learned from self-play.

== Conclusion ==

As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained.
Developing a superhuman AI for multiplayer poker was a widely recognized
a milestone in this area and the major remaining milestone in computer poker.
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.

== References ==

Noam Brown and Tuomas Sandholm (July 11, 2019). Superhuman AI for multiplayer poker. Science 365.

Osborne, Martin J.; Rubinstein, Ariel (July 12, 1994). A Course in Game Theory. Cambridge, MA: MIT. p. 14.

Justin Sermeno. (November 17, 2020). Vanilla Counterfactual Regret Minimization for Engineers. https://justinsermeno.com/posts/cfr/#:~:text=Counterfactual%20regret%20minimization%20%28CFR%29%20is%20an%20algorithm%20that,decision.%20It%20can%20be%20positive%2C%20negative%2C%20or%20zero

Brown, N., Bakhtin, A., Lerer, A., & Gong, Q. (2020). Combining deep reinforcement learning and search for imperfect-information games. Advances in Neural Information Processing Systems, 33.

Superhuman AI for Multiplayer Poker

2020-11-30T00:20:10Z

D287zhan: /* Discussion and Critiques */

== Presented by ==
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty

== Introduction ==

A superintelligence is a hypothetical agent that possesses intelligence far surpassing that of the brightest and most gifted human minds. In the past two decades, most of the superhuman AI that was built can only beat human players in two-player zero-sum games. They almost dominated most of the board games in these twenty years. The most popular AI in the board games are the chess AI deep blue and the go chess AI Alpha-go. The most common strategy that the AI uses to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is a pair of strategies such that either single-player switching to any ''other'' choice of strategy (while the other player's strategy remains unchanged) will result in a lower payout for the switching player. Intuitively this is similar to a locally optimal strategy for the players but is (i) not guaranteed to exist and (ii) may not be the truly optimal strategy. An example of this is the Prisoner's dilemma, where two individuals each have the option to testify against the other or to remain silent. Although the optimal choice is to remain silent, the individuals have an incentive to act in their own self-interest which results in a less than optimal outcome.

More specifically, in the game of poker, we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.

In this paper, the AI which we call Pluribus is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used is not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. This shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.

== Nash Equilibrium in Multiplayer Games ==

Many AI has reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to exist in all finite games and numerous infinite games. However, the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games since it guarantees to not lose in expectation regardless of what the opponent is doing.

To have a deeper understanding of Nash Equilibria we must first define some basic game theory concepts. The first one being a strategic game, in game theory a strategic game consists of a set of players, for each player a set of actions and for each player preferences (or payoffs) over the set of action profiles (set of combination of actions). With these three elements, we can model a wide variety of situations. Now a Nash Equilibrium is an action profile, with the property that no player can do better by changing their action, given that all other players' actions remain the same. A common illustration of Nash equilibria is the Prisoner's Dilemma. We also have mixed strategies and mixed strategy Nash equilibria. A mixed strategy is when instead of a player choosing an action they apply a probability distribution to their set of actions and pick randomly. Note that with mixed strategies we must look at the expected payoff of the player given the other players' strategies. Therefore a mixed strategy Nash Equilibria involves at least one player playing with a mixed strategy where no player can increase their expected payoff by changing their action, given that all other players' actions remain the same. Then we can define a pure Nash Equilibria to where no one is playing a mixed strategy. We also must be aware that a single game can have multiple pure Nash equilibria and mixed Nash equilibria. Also, Nash Equilibria are purely theoretical and depend on players acting optimally and being rational, this is not always the case with humans and we can act very irrationally. Therefore empirically we will see that games can have very unexpected outcomes and you may be able to get a better payoff if you move away from a strictly theoretical strategy and take advantage of your opponent's irrational behavior.

The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. At the Nash equilibrium, there is no incentive for any player to change their initial strategy, so it is a stable state of the system. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example, our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy overtime to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, shifting away from the Nash equilibrium strategy opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.

Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently, existing techniques to find ways to exploit an opponent require way too many samples and are not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a great challenge. Even we can efficiently compute a Nash equilibrium in games with more than two players, it is still highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.

Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colors which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.

[[File:Lemonade_Example.png| 600px |center ]]

<div align="center">Figure 1: Lemonade Stand Example</div>

From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibria, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.

== Theoretical Analysis ==
Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of abstraction: Action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (the exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.

Pluribus uses a built-in strategy - “Blueprint strategy”, which it gradually improves by searching in real-time in situations it finds itself in during the course of the game. In the first betting round, pluribus uses the initial blueprint strategy when the number of decision points is small. The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken.

[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]

<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div>

At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by a player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. The AI compares its decision with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, the Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with your decision and zero regret indicates that you are indifferent.

The value of counterfactual regret for a decision is adjusted over the iterations as more scenarios or decision points are encountered. This means at the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret are chosen with higher probability. CFR minimizes regret over many iterations until the average strategy overall iterations converge and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T. This causes the influence of the first iteration to decay at a rate of <math>\frac{1}{\sum_{t=1}^Tt} = \frac{2}{T(T+1)}</math>, compared to a rate of <math>\frac{1}{T}</math> in the original CFR algorithm. This leads to the strategy of improving more quickly in practice.

An additional feature of Pluribus is that in the subgames, instead of assuming that all players play according to a single strategy, Pluribus considers that each player may choose between k different strategies specialized to each player when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance, if a player never bluffs while holding the best possible hand then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a balanced strategy than to bet.
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real-time decisions during the game.

== Experimental Results ==
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played.

The second format included one human player and 5 copies of Pluribus (1H+5AI). There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.
The performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.

Applying AIVAT the following were the results:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable
|-
! scope="row" | 5H+1AI
| 48 || 25 || 0.028
|-
! scope="row" | 1H+5AI
| 32 || 15 || 0.014
|}
[[File:top.PNG| 950px | x450px |left]]

<div align="center">"Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot."</div>

Optimal play in Pluribus looks different from well-known poker conventions: A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans and is proven to be profitable.

== Discussion and Critiques ==

Pluribus' Blueprint strategy and Abstraction methods effectively reduce the computational power required. Hence it was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers.

Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using a non-theoretical approach in more real-life problems such as autonomous driving or stock market trading.

Extending this idea beyond two-player zero-sum games will have many applications in real life.

The summary for Superhuman AI for Multiplayer Poker is very well written, with a detailed explanation of the concept, steps, and result and with a combination of visual images. However, it seems that the experiment of the study is not well designed. For example, sample selection is not strict and well defined, this could cause selection bias introduced into the result and thus making it not generalizable.

Superhuman AI, while sounding superior, is actually not uncommon. There have been many endeavours on mastering poker such as the Recursive Belief-based Learning (ReBeL) by Facebook Research. They pursued a method of reinforcement learning on a partially observable Markov decision process which was inspired by the recent successes of AlphaZero. For Pluribus to demonstrate how effective it is compared to the state-of-the-art, it should run some experiments against ReBeL.

This is a very interesting topic, and this summary is clear enough for readers to understand. I think this application not only can apply in poker, maybe thinking of more applications in other areas? There are many famous AI that really changing our life. For example, AlphaGo and AlphaStar, which are developed by Google DeepMind, defeated professional gamers. Discussing more this will be interesting.

One of the biggest issues when applying AI to games against humans (when not all information is known, ie, opponents' cards) is the assumption is generally made that the human players are rational players which follow a certain set of "rules" based on the information that they know. This could be an issue with the fact that Pluribus has trained itself by playing itself instead of humans. While the results clearly show that Pluribus has found some kind of 'optimal' method to play, it would be interesting to see if it could actually maximize its profits by learning the trends of its human opponents over time (learning on the fly with information gained each hand while it's playing). In addition to that, the paper may discuss how human action could be changed in the game when they play with Superhuman AI. We can see that playing card games require various strategy and different people can have a different set of actions in the same game and in the same situation.

One interesting software called Piosolver leverages a similar tree-based algorithm presented in the paper to recommend the move that is deemed game theory optimal (GTO). In the poker world, GTO is a play-style that is based on mathematics and is considered a "defensive" strategy. Following the rock, paper, scissors analogy from the paper, a GTO play-style is synonymous with choosing randomly between the three options, whereas an exploitative strategy involves reading a human player's tendencies and adjusting the strategy accordingly. Piosolver is used by many professional poker players to enhance their game and gain intuition on what the best move is in certain situations.

Another way to train the proposed model can be a poker game with two or more AI players. That method was used by AlphaGo to train a better model.

Games with various AI players would be an interesting topic, and through comparing different AI, their shortcomes could be observed and improved. More discussions on this would be of interests.

Similar to Pluribis, recently, there was another AI

== Conclusion ==

As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained.
Developing a superhuman AI for multiplayer poker was a widely recognized
a milestone in this area and the major remaining milestone in computer poker.
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.

== References ==

Noam Brown and Tuomas Sandholm (July 11, 2019). Superhuman AI for multiplayer poker. Science 365.

Osborne, Martin J.; Rubinstein, Ariel (July 12, 1994). A Course in Game Theory. Cambridge, MA: MIT. p. 14.

Justin Sermeno. (November 17, 2020). Vanilla Counterfactual Regret Minimization for Engineers. https://justinsermeno.com/posts/cfr/#:~:text=Counterfactual%20regret%20minimization%20%28CFR%29%20is%20an%20algorithm%20that,decision.%20It%20can%20be%20positive%2C%20negative%2C%20or%20zero

Brown, N., Bakhtin, A., Lerer, A., & Gong, Q. (2020). Combining deep reinforcement learning and search for imperfect-information games. Advances in Neural Information Processing Systems, 33.

Superhuman AI for Multiplayer Poker

2020-11-30T00:19:13Z

D287zhan: /* Discussion and Critiques */

== Presented by ==
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty

== Introduction ==

A superintelligence is a hypothetical agent that possesses intelligence far surpassing that of the brightest and most gifted human minds. In the past two decades, most of the superhuman AI that was built can only beat human players in two-player zero-sum games. They almost dominated most of the board games in these twenty years. The most popular AI in the board games are the chess AI deep blue and the go chess AI Alpha-go. The most common strategy that the AI uses to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is a pair of strategies such that either single-player switching to any ''other'' choice of strategy (while the other player's strategy remains unchanged) will result in a lower payout for the switching player. Intuitively this is similar to a locally optimal strategy for the players but is (i) not guaranteed to exist and (ii) may not be the truly optimal strategy. An example of this is the Prisoner's dilemma, where two individuals each have the option to testify against the other or to remain silent. Although the optimal choice is to remain silent, the individuals have an incentive to act in their own self-interest which results in a less than optimal outcome.

More specifically, in the game of poker, we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.

In this paper, the AI which we call Pluribus is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used is not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. This shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.

== Nash Equilibrium in Multiplayer Games ==

Many AI has reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to exist in all finite games and numerous infinite games. However, the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games since it guarantees to not lose in expectation regardless of what the opponent is doing.

To have a deeper understanding of Nash Equilibria we must first define some basic game theory concepts. The first one being a strategic game, in game theory a strategic game consists of a set of players, for each player a set of actions and for each player preferences (or payoffs) over the set of action profiles (set of combination of actions). With these three elements, we can model a wide variety of situations. Now a Nash Equilibrium is an action profile, with the property that no player can do better by changing their action, given that all other players' actions remain the same. A common illustration of Nash equilibria is the Prisoner's Dilemma. We also have mixed strategies and mixed strategy Nash equilibria. A mixed strategy is when instead of a player choosing an action they apply a probability distribution to their set of actions and pick randomly. Note that with mixed strategies we must look at the expected payoff of the player given the other players' strategies. Therefore a mixed strategy Nash Equilibria involves at least one player playing with a mixed strategy where no player can increase their expected payoff by changing their action, given that all other players' actions remain the same. Then we can define a pure Nash Equilibria to where no one is playing a mixed strategy. We also must be aware that a single game can have multiple pure Nash equilibria and mixed Nash equilibria. Also, Nash Equilibria are purely theoretical and depend on players acting optimally and being rational, this is not always the case with humans and we can act very irrationally. Therefore empirically we will see that games can have very unexpected outcomes and you may be able to get a better payoff if you move away from a strictly theoretical strategy and take advantage of your opponent's irrational behavior.

The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. At the Nash equilibrium, there is no incentive for any player to change their initial strategy, so it is a stable state of the system. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example, our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy overtime to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, shifting away from the Nash equilibrium strategy opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.

Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently, existing techniques to find ways to exploit an opponent require way too many samples and are not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a great challenge. Even we can efficiently compute a Nash equilibrium in games with more than two players, it is still highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.

Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colors which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.

[[File:Lemonade_Example.png| 600px |center ]]

<div align="center">Figure 1: Lemonade Stand Example</div>

From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibria, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.

== Theoretical Analysis ==
Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of abstraction: Action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (the exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.

Pluribus uses a built-in strategy - “Blueprint strategy”, which it gradually improves by searching in real-time in situations it finds itself in during the course of the game. In the first betting round, pluribus uses the initial blueprint strategy when the number of decision points is small. The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken.

[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]

<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div>

At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by a player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. The AI compares its decision with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, the Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with your decision and zero regret indicates that you are indifferent.

The value of counterfactual regret for a decision is adjusted over the iterations as more scenarios or decision points are encountered. This means at the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret are chosen with higher probability. CFR minimizes regret over many iterations until the average strategy overall iterations converge and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T. This causes the influence of the first iteration to decay at a rate of <math>\frac{1}{\sum_{t=1}^Tt} = \frac{2}{T(T+1)}</math>, compared to a rate of <math>\frac{1}{T}</math> in the original CFR algorithm. This leads to the strategy of improving more quickly in practice.

An additional feature of Pluribus is that in the subgames, instead of assuming that all players play according to a single strategy, Pluribus considers that each player may choose between k different strategies specialized to each player when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance, if a player never bluffs while holding the best possible hand then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a balanced strategy than to bet.
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real-time decisions during the game.

== Experimental Results ==
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played.

The second format included one human player and 5 copies of Pluribus (1H+5AI). There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.
The performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.

Applying AIVAT the following were the results:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable
|-
! scope="row" | 5H+1AI
| 48 || 25 || 0.028
|-
! scope="row" | 1H+5AI
| 32 || 15 || 0.014
|}
[[File:top.PNG| 950px | x450px |left]]

<div align="center">"Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot."</div>

Optimal play in Pluribus looks different from well-known poker conventions: A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans and is proven to be profitable.

== Discussion and Critiques ==

Pluribus' Blueprint strategy and Abstraction methods effectively reduce the computational power required. Hence it was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers.

Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using a non-theoretical approach in more real-life problems such as autonomous driving or stock market trading.

Extending this idea beyond two-player zero-sum games will have many applications in real life.

The summary for Superhuman AI for Multiplayer Poker is very well written, with a detailed explanation of the concept, steps, and result and with a combination of visual images. However, it seems that the experiment of the study is not well designed. For example, sample selection is not strict and well defined, this could cause selection bias introduced into the result and thus making it not generalizable.

Superhuman AI, while sounding superior, is actually not uncommon. There have been many endeavours on mastering poker such as the Recursive Belief-based Learning (ReBeL) by Facebook Research. They pursued a method of reinforcement learning on a partially observable Markov decision process which was inspired by the recent successes of AlphaZero. For Pluribus to demonstrate how effective it is compared to the state-of-the-art, it should run some experiments against ReBeL.

This is a very interesting topic, and this summary is clear enough for readers to understand. I think this application not only can apply in poker, maybe thinking of more applications in other areas? There are many famous AI that really changing our life. For example, AlphaGo and AlphaStar, which are developed by Google DeepMind, defeated professional gamers. Discussing more this will be interesting.

One of the biggest issues when applying AI to games against humans (when not all information is known, ie, opponents' cards) is the assumption is generally made that the human players are rational players which follow a certain set of "rules" based on the information that they know. This could be an issue with the fact that Pluribus has trained itself by playing itself instead of humans. While the results clearly show that Pluribus has found some kind of 'optimal' method to play, it would be interesting to see if it could actually maximize its profits by learning the trends of its human opponents over time (learning on the fly with information gained each hand while it's playing). In addition to that, the paper may discuss how human action could be changed in the game when they play with Superhuman AI. We can see that playing card games require various strategy and different people can have a different set of actions in the same game and in the same situation.

One interesting software called Piosolver leverages a similar tree-based algorithm presented in the paper to recommend the move that is deemed game theory optimal (GTO). In the poker world, GTO is a play-style that is based on mathematics and is considered a "defensive" strategy. Following the rock, paper, scissors analogy from the paper, a GTO play-style is synonymous with choosing randomly between the three options, whereas an exploitative strategy involves reading a human player's tendencies and adjusting the strategy accordingly. Piosolver is used by many professional poker players to enhance their game and gain intuition on what the best move is in certain situations.

Another way to train the proposed model can be a poker game with two or more AI players. That method was used by AlphaGo to train a better model.

Games with various AI players would be an interesting topic, and through comparing different AI, their shortcomes could be observed and improved. More discussions on this would be of interests.

Similar to Pluribis, recently, there was another

== Conclusion ==

As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained.
Developing a superhuman AI for multiplayer poker was a widely recognized
a milestone in this area and the major remaining milestone in computer poker.
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.

== References ==

Noam Brown and Tuomas Sandholm (July 11, 2019). Superhuman AI for multiplayer poker. Science 365.

Osborne, Martin J.; Rubinstein, Ariel (July 12, 1994). A Course in Game Theory. Cambridge, MA: MIT. p. 14.

Justin Sermeno. (November 17, 2020). Vanilla Counterfactual Regret Minimization for Engineers. https://justinsermeno.com/posts/cfr/#:~:text=Counterfactual%20regret%20minimization%20%28CFR%29%20is%20an%20algorithm%20that,decision.%20It%20can%20be%20positive%2C%20negative%2C%20or%20zero

Brown, N., Bakhtin, A., Lerer, A., & Gong, Q. (2020). Combining deep reinforcement learning and search for imperfect-information games. Advances in Neural Information Processing Systems, 33.

Music Recommender System Based using CRNN

2020-11-30T00:13:30Z

D287zhan: /* Critiques/ Insights: */

==Introduction and Objective:==

In the digital era of music streaming, companies, such as Spotify and Pandora, are faced with the following challenge: can they provide users with relevant and personalized music recommendations amidst the ever-growing abundance of music and user data.

The objective of this paper is to implement a personalized music recommender system that takes user listening history as input and continually finds new music that captures individual user preferences.

This paper argues that a music recommendation system should vary from the general recommendation system used in practice since it should combine music feature recognition and audio processing technologies to extract music features, and combine them with data on user preferences.

The authors of this paper took a content-based music approach to build the recommendation system - specifically, comparing the similarity of features based on the audio signal.

The following two-method approach to building the recommendation system was followed:
#Make recommendations including genre information extracted from classification algorithms.
#Make recommendations without genre information.

The authors used convolutional recurrent neural networks (CRNN), which is a combination of convolutional neural networks (CNN) and recurrent neural network(RNN), as their main classification model.

==Methods and Techniques:==
Generally, a music recommender can be divided into three main parts: (I) users, (ii) items, and (iii) user-item matching algorithms. First, we generated users' music tastes based on their profiles. Second, item profiling includes editorial, cultural, and acoustic metadata were collected for listeners' satisfaction. Finally, we come to the matching algorithm that suggests recommended personalized music to listeners.

To classify music, the original music’s audio signal is converted into a spectrogram image. Using the image and the Short Time Fourier Transform (STFT), we convert the data into the Mel scale which is used in the CNN and CRNN models.
=== Mel Scale: ===
The scale of pitches that are heard by listeners, which translates to equal pitch increments.

[[File:Mel.png|frame|none|Mel Scale on Spectrogram]]

=== Short Time Fourier Transform (STFT): ===
The transformation that determines the sinusoidal frequency of the audio, with a Hanning smoothing function. In the continuous case this is written as: <math>\mathbf{STFT}\{x(t)\}(\tau,\omega) \equiv X(\tau, \omega) = \int_{-\infty}^{\infty} x(t) w(t-\tau) e^{-i \omega t} \, d t </math>

where: <math>w(\tau)</math> is the Hanning smoothing function

=== Convolutional Neural Network (CNN): ===
Neural Network that uses convolution in place of matrix multiplication for some layer calculations. By training the data, weights for inputs are updated to find the most significant data relevant to classification. These convolutional layers gather small groups of data and with kernels, and try to find patterns that can help find features in the overall data. The features are then used for classification. Padding is also used to maintain the data on the edges.

[[File:Convolution.png|thumb|400px|left|Convolution Operation]]
[[File:PaddingKernels.png|thumb|400px|center|Example of Padding (white 0s) and Kernels (blue square)]]

=== Convolutional Recurrent Neural Network (CRNN): ===
Similar Neural Network as CNN, with the addition of a GRU, which is a Recurrent Neural Network (RNN). An RNN is used to treat sequential data, by reusing the activation function of previous nodes to update the output. A Gated Recurrent Unit (GRU) is used to store more long-term memory and will help train the early hidden layers.

[[File:GRU441.png|thumb|400px|left|Gated Recurrent Unit (GRU)]]
[[File:Recurrent441.png|thumb|400px|center|Diagram of General Recurrent Neural Network]]

==Data Screening:==

The authors of this paper used a publicly available music dataset made up of 25,000 30-second songs from the Free Music Archives. To ensure a balanced dataset, only 1000 songs each from the genres of classical, electronic, folk, hip-hop, instrumental, jazz and rock were used in the final model.

[[File:Data441.png|thumb|200px|none|Data sorted by music genre]]

==Implementation:==

=== Modeling Neural Networks ===

As noted previously, both CNNs and CRNNs were used to model the data. The advantage of CRNNs is that they are able to model time sequence patterns in addition to frequency features from the spectrogram, allowing for greater identification of important features. Furthermore, feature vectors produced before the classification stage could be used to improve accuracy.

In implementing the neural networks, the Mel-spectrogram data was split up into training, validation, and test sets at a ratio of 8:1:1 respectively and labelled via one-hot encoding. This made it possible for the categorical data to be labelled correctly for binary classification. As opposed to classical stochastic gradient descent, the authors opted to use Adam optimization to update weights in the training phase. Binary cross-entropy was used as the loss function.

In both the CNN and CRNN models, the data was trained over 100 epochs with a binary cross-entropy loss function. The sigmoid function was used as the output layer.

An overview of the CNN and CRNN architecture can be found in the charts below.

[[File:CNN441.png|thumb|800px|none|Implementation of CNN Model]]
[[File:CRNN441.png|thumb|800px|none|Implementation of CRNN Model]]

=== Music Recommendation System ===

The recommendation system is computed by the cosine similarity of the extraction features from the neural network. Each genre will have a song act as a centre point for each class. The final inputs of the trained neural networks will be the feature variables. The featured variables will be used in the cosine similarity to find the best recommendations.

The values are between [-1,1], where larger values are songs that have similar features. When the user inputs five songs, those songs become the new inputs in the neural networks and the features are used by the cosine similarity with other music. The largest five cosine similarities are used as recommendations.
[[File:Cosine441.png|frame|100px|none|Cosine Similarity]]

== Evaluation Metrics ==
=== Precision: ===
* The proportion of True Positives with respect to the '''predicted''' positive cases (true positives and false positives)
* For example, out of all the songs that the classifier '''predicted''' as Classical, how many are actually Classical?
* Describes the rate at which the classifier predicts the true genre of songs among those predicted to be of that certain genre

=== Recall: ===
* The proportion of True Positives with respect to the '''actual''' positive cases (true positives and false negatives)
* For example, out of all the songs that are '''actually''' Classical, how many are correctly predicted to be Classical?
* Describes the rate at which the classifier predicts the true genre of songs among the correct instances of that genre

=== F1-Score: ===
An accuracy metric that combines the classifier’s precision and recall scores by taking the harmonic mean between the two metrics:

[[File:F1441.png|frame|100px|none|F1-Score]]

=== Receiver operating characteristics (ROC): ===
* A graphical metric that is used to assess a classification model at different classification thresholds
* In the case of a classification threshold of 0.5, this means that if <math>P(Y = k | X = x) > 0.5</math> then we classify this instance as class k
* Plots the true positive rate versus false positive rate as the classification threshold is varied

[[File:ROCGraph.jpg|thumb|400px|none|ROC Graph. Comparison of True Positive Rate and False Positive Rate]]

=== Area Under the Curve (AUC) ===
AUC is the area under the ROC in doing so, the ROC provides an aggregate measure across all possible classification thresholds.

In the context of the paper: When scoring all songs as <math>Prob(Classical | X=x)</math>, it is the probability that the model ranks a random Classical song at a higher probability than a random non-Classical song.

[[File:AUCGraph.jpg|thumb|400px|none|Area under the ROC curve.]]

== Results ==
=== Accuracy Metrics ===
The table below is the accuracy metrics with the classification threshold of 0.5.

[[File:TruePositiveChart.jpg|thumb|none|True Positive / False Positive Chart]]
On average, CRNN outperforms CNN in true positive and false positive cases.

[[File:F1Chart441.jpg|thumb|400px|none|F1 Chart]]
On average, CRNN outperforms CNN in F1-score.

[[File:AUCChart.jpg|thumb|400px|none|AUC Chart]]
On average, CRNN also outperforms CNN in AUC metric.

CRNN models that consider the frequency features and time sequence patterns of songs have a better classification performance through metrics such as F1 score and AUC when comparing to CNN classifier.

=== Evaluation of Music Recommendation System: ===

* A listening experiment was performed with 30 participants to access user responses to given music recommendations.
* Participants choose 5 preferred music and the recommender system gives 5 recommendations; the participants evaluated the music recommendation by recording whether the song was liked or disliked.
* The recommendation system takes two approaches to the recommendation:
** Method one uses only the value of cosine similarity.
** Method two uses the value of cosine similarity and information on music genre.
*Perform test of significance of differences in respondents to the two methods using t-statistic:
[[File:H0441.png|frame|100px|none|Hypothesis test between method 1 and method 2]]

Comparing the two methods, <math> H_0: u_1 - u_2 = 0</math>, we have <math> t_{stat} = -4.743 < -2.037 </math> which concludes that the addition of a music genre information increases -- it is statistically significant.

== Conclusion: ==

Here are two main conclusions obtained from this paper:

- To increase the predictive capabilities of the music recommendation system, the music genre should be a key feature to analyze.

- To extract the song genre from a song’s audio signals and get overall better performance, CRNN’s are superior to CNN’s as they consider frequency in features and time sequence patterns of audio signals.

According to analyses in the paper, the authors also suggested adding other music features like tempo gram for capturing local tempo to improve the accuracy of the recommender system.

== Critiques/ Insights: ==
# The authors fail to give reference to the performance of current recommendation algorithms used in the industry; my critique would be for the authors to bench-mark their novel approach with other recommendation algorithms such as collaborative filtering to see if there is a lift in predictive capabilities.
# The listening experiment used to evaluate the recommendation system only includes songs that are outputted by the model. Users may be biased if they believe all songs have come from a recommendation system. To remove bias, we suggest having 15 songs where 5 songs are recommended and 10 songs are set. With this in the user’s mind, it may remove some bias in response and give more accurate predictive capabilities.
# They could go into more details about how CRNN makes it perform better than CNN, in terms of attributes of each network.
# The methodology introduced in this paper is probably also suitable for movie recommendations. As music is presented as spectrograms (images) in a time sequence, and it is very similar to a movie.
# The way of evaluation is a very interesting approach. Since it's usually not easy to evaluate the testing result when it's subjective. By listing all these evaluations' performance, the result would be more comprehensive.
# The paper lacks the comparison between the proposed algorithm and the music recommendation algorithms being used now. It will be clearer to show the superiority of this algorithm.
# The GAN neural network has been proposed to enhance the performance of the neural network, so an improved result may appear after considering using GAN.
# The limitation of CNN and CRNN could be that they are only able to process the spectrograms with single labels rather than multiple labels. This is far from enough for the music recommender systems in today's music industry since the edges between various genres are blurred.
# according to the author, the recommender system is done by calculating the cosine similarity of extraction features from one music to another music. Is possible to represent it by Euclidean distance or p-norm distances?
# In real-life application, most of the music software will have the ability to recommend music to the listener and ask do they like the music that was recommended. It would be a nice application by involving some new information from the listener.
# This paper is very similar to another [https://link.springer.com/chapter/10.1007/978-3-319-46131-1_29 paper], written by Bruce Fewerda and Markus Schedl. Both papers are suggesting methods of building music recommendation systems. However, this paper recommends music based on genre, but the paper written by Fewerda and Schedl suggests a personality-based user modeling for music recommender systems.

Music Recommender System Based using CRNN

2020-11-30T00:12:58Z

D287zhan: /* Critiques/ Insights: */

==Introduction and Objective:==

In the digital era of music streaming, companies, such as Spotify and Pandora, are faced with the following challenge: can they provide users with relevant and personalized music recommendations amidst the ever-growing abundance of music and user data.

The objective of this paper is to implement a personalized music recommender system that takes user listening history as input and continually finds new music that captures individual user preferences.

This paper argues that a music recommendation system should vary from the general recommendation system used in practice since it should combine music feature recognition and audio processing technologies to extract music features, and combine them with data on user preferences.

The authors of this paper took a content-based music approach to build the recommendation system - specifically, comparing the similarity of features based on the audio signal.

The following two-method approach to building the recommendation system was followed:
#Make recommendations including genre information extracted from classification algorithms.
#Make recommendations without genre information.

The authors used convolutional recurrent neural networks (CRNN), which is a combination of convolutional neural networks (CNN) and recurrent neural network(RNN), as their main classification model.

==Methods and Techniques:==
Generally, a music recommender can be divided into three main parts: (I) users, (ii) items, and (iii) user-item matching algorithms. First, we generated users' music tastes based on their profiles. Second, item profiling includes editorial, cultural, and acoustic metadata were collected for listeners' satisfaction. Finally, we come to the matching algorithm that suggests recommended personalized music to listeners.

To classify music, the original music’s audio signal is converted into a spectrogram image. Using the image and the Short Time Fourier Transform (STFT), we convert the data into the Mel scale which is used in the CNN and CRNN models.
=== Mel Scale: ===
The scale of pitches that are heard by listeners, which translates to equal pitch increments.

[[File:Mel.png|frame|none|Mel Scale on Spectrogram]]

=== Short Time Fourier Transform (STFT): ===
The transformation that determines the sinusoidal frequency of the audio, with a Hanning smoothing function. In the continuous case this is written as: <math>\mathbf{STFT}\{x(t)\}(\tau,\omega) \equiv X(\tau, \omega) = \int_{-\infty}^{\infty} x(t) w(t-\tau) e^{-i \omega t} \, d t </math>

where: <math>w(\tau)</math> is the Hanning smoothing function

=== Convolutional Neural Network (CNN): ===
Neural Network that uses convolution in place of matrix multiplication for some layer calculations. By training the data, weights for inputs are updated to find the most significant data relevant to classification. These convolutional layers gather small groups of data and with kernels, and try to find patterns that can help find features in the overall data. The features are then used for classification. Padding is also used to maintain the data on the edges.

[[File:Convolution.png|thumb|400px|left|Convolution Operation]]
[[File:PaddingKernels.png|thumb|400px|center|Example of Padding (white 0s) and Kernels (blue square)]]

=== Convolutional Recurrent Neural Network (CRNN): ===
Similar Neural Network as CNN, with the addition of a GRU, which is a Recurrent Neural Network (RNN). An RNN is used to treat sequential data, by reusing the activation function of previous nodes to update the output. A Gated Recurrent Unit (GRU) is used to store more long-term memory and will help train the early hidden layers.

[[File:GRU441.png|thumb|400px|left|Gated Recurrent Unit (GRU)]]
[[File:Recurrent441.png|thumb|400px|center|Diagram of General Recurrent Neural Network]]

==Data Screening:==

The authors of this paper used a publicly available music dataset made up of 25,000 30-second songs from the Free Music Archives. To ensure a balanced dataset, only 1000 songs each from the genres of classical, electronic, folk, hip-hop, instrumental, jazz and rock were used in the final model.

[[File:Data441.png|thumb|200px|none|Data sorted by music genre]]

==Implementation:==

=== Modeling Neural Networks ===

As noted previously, both CNNs and CRNNs were used to model the data. The advantage of CRNNs is that they are able to model time sequence patterns in addition to frequency features from the spectrogram, allowing for greater identification of important features. Furthermore, feature vectors produced before the classification stage could be used to improve accuracy.

In implementing the neural networks, the Mel-spectrogram data was split up into training, validation, and test sets at a ratio of 8:1:1 respectively and labelled via one-hot encoding. This made it possible for the categorical data to be labelled correctly for binary classification. As opposed to classical stochastic gradient descent, the authors opted to use Adam optimization to update weights in the training phase. Binary cross-entropy was used as the loss function.

In both the CNN and CRNN models, the data was trained over 100 epochs with a binary cross-entropy loss function. The sigmoid function was used as the output layer.

An overview of the CNN and CRNN architecture can be found in the charts below.

[[File:CNN441.png|thumb|800px|none|Implementation of CNN Model]]
[[File:CRNN441.png|thumb|800px|none|Implementation of CRNN Model]]

=== Music Recommendation System ===

The recommendation system is computed by the cosine similarity of the extraction features from the neural network. Each genre will have a song act as a centre point for each class. The final inputs of the trained neural networks will be the feature variables. The featured variables will be used in the cosine similarity to find the best recommendations.

The values are between [-1,1], where larger values are songs that have similar features. When the user inputs five songs, those songs become the new inputs in the neural networks and the features are used by the cosine similarity with other music. The largest five cosine similarities are used as recommendations.
[[File:Cosine441.png|frame|100px|none|Cosine Similarity]]

== Evaluation Metrics ==
=== Precision: ===
* The proportion of True Positives with respect to the '''predicted''' positive cases (true positives and false positives)
* For example, out of all the songs that the classifier '''predicted''' as Classical, how many are actually Classical?
* Describes the rate at which the classifier predicts the true genre of songs among those predicted to be of that certain genre

=== Recall: ===
* The proportion of True Positives with respect to the '''actual''' positive cases (true positives and false negatives)
* For example, out of all the songs that are '''actually''' Classical, how many are correctly predicted to be Classical?
* Describes the rate at which the classifier predicts the true genre of songs among the correct instances of that genre

=== F1-Score: ===
An accuracy metric that combines the classifier’s precision and recall scores by taking the harmonic mean between the two metrics:

[[File:F1441.png|frame|100px|none|F1-Score]]

=== Receiver operating characteristics (ROC): ===
* A graphical metric that is used to assess a classification model at different classification thresholds
* In the case of a classification threshold of 0.5, this means that if <math>P(Y = k | X = x) > 0.5</math> then we classify this instance as class k
* Plots the true positive rate versus false positive rate as the classification threshold is varied

[[File:ROCGraph.jpg|thumb|400px|none|ROC Graph. Comparison of True Positive Rate and False Positive Rate]]

=== Area Under the Curve (AUC) ===
AUC is the area under the ROC in doing so, the ROC provides an aggregate measure across all possible classification thresholds.

In the context of the paper: When scoring all songs as <math>Prob(Classical | X=x)</math>, it is the probability that the model ranks a random Classical song at a higher probability than a random non-Classical song.

[[File:AUCGraph.jpg|thumb|400px|none|Area under the ROC curve.]]

== Results ==
=== Accuracy Metrics ===
The table below is the accuracy metrics with the classification threshold of 0.5.

[[File:TruePositiveChart.jpg|thumb|none|True Positive / False Positive Chart]]
On average, CRNN outperforms CNN in true positive and false positive cases.

[[File:F1Chart441.jpg|thumb|400px|none|F1 Chart]]
On average, CRNN outperforms CNN in F1-score.

[[File:AUCChart.jpg|thumb|400px|none|AUC Chart]]
On average, CRNN also outperforms CNN in AUC metric.

CRNN models that consider the frequency features and time sequence patterns of songs have a better classification performance through metrics such as F1 score and AUC when comparing to CNN classifier.

=== Evaluation of Music Recommendation System: ===

* A listening experiment was performed with 30 participants to access user responses to given music recommendations.
* Participants choose 5 preferred music and the recommender system gives 5 recommendations; the participants evaluated the music recommendation by recording whether the song was liked or disliked.
* The recommendation system takes two approaches to the recommendation:
** Method one uses only the value of cosine similarity.
** Method two uses the value of cosine similarity and information on music genre.
*Perform test of significance of differences in respondents to the two methods using t-statistic:
[[File:H0441.png|frame|100px|none|Hypothesis test between method 1 and method 2]]

Comparing the two methods, <math> H_0: u_1 - u_2 = 0</math>, we have <math> t_{stat} = -4.743 < -2.037 </math> which concludes that the addition of a music genre information increases -- it is statistically significant.

== Conclusion: ==

Here are two main conclusions obtained from this paper:

- To increase the predictive capabilities of the music recommendation system, the music genre should be a key feature to analyze.

- To extract the song genre from a song’s audio signals and get overall better performance, CRNN’s are superior to CNN’s as they consider frequency in features and time sequence patterns of audio signals.

According to analyses in the paper, the authors also suggested adding other music features like tempo gram for capturing local tempo to improve the accuracy of the recommender system.

== Critiques/ Insights: ==
# The authors fail to give reference to the performance of current recommendation algorithms used in the industry; my critique would be for the authors to bench-mark their novel approach with other recommendation algorithms such as collaborative filtering to see if there is a lift in predictive capabilities.
# The listening experiment used to evaluate the recommendation system only includes songs that are outputted by the model. Users may be biased if they believe all songs have come from a recommendation system. To remove bias, we suggest having 15 songs where 5 songs are recommended and 10 songs are set. With this in the user’s mind, it may remove some bias in response and give more accurate predictive capabilities.
# They could go into more details about how CRNN makes it perform better than CNN, in terms of attributes of each network.
# The methodology introduced in this paper is probably also suitable for movie recommendations. As music is presented as spectrograms (images) in a time sequence, and it is very similar to a movie.
# The way of evaluation is a very interesting approach. Since it's usually not easy to evaluate the testing result when it's subjective. By listing all these evaluations' performance, the result would be more comprehensive.
# The paper lacks the comparison between the proposed algorithm and the music recommendation algorithms being used now. It will be clearer to show the superiority of this algorithm.
# The GAN neural network has been proposed to enhance the performance of the neural network, so an improved result may appear after considering using GAN.
# The limitation of CNN and CRNN could be that they are only able to process the spectrograms with single labels rather than multiple labels. This is far from enough for the music recommender systems in today's music industry since the edges between various genres are blurred.
# according to the author, the recommender system is done by calculating the cosine similarity of extraction features from one music to another music. Is possible to represent it by Euclidean distance or p-norm distances?
# In real-life application, most of the music software will have the ability to recommend music to the listener and ask do they like the music that was recommended. It would be a nice application by involving some new information from the listener.
# This paper is very similar to another paper, [https://link.springer.com/chapter/10.1007/978-3-319-46131-1_29 Paper], written by Bruce Fewerda and Markus Schedl. Both papers are suggesting methods of building music recommendation systems. However, this paper recommends music based on genre, but the paper written by Fewerda and Schedl suggests a personality-based user modeling for music recommender systems.

Music Recommender System Based using CRNN

2020-11-30T00:11:45Z

D287zhan: /* Critiques/ Insights: */

==Introduction and Objective:==

In the digital era of music streaming, companies, such as Spotify and Pandora, are faced with the following challenge: can they provide users with relevant and personalized music recommendations amidst the ever-growing abundance of music and user data.

The objective of this paper is to implement a personalized music recommender system that takes user listening history as input and continually finds new music that captures individual user preferences.

This paper argues that a music recommendation system should vary from the general recommendation system used in practice since it should combine music feature recognition and audio processing technologies to extract music features, and combine them with data on user preferences.

The authors of this paper took a content-based music approach to build the recommendation system - specifically, comparing the similarity of features based on the audio signal.

The following two-method approach to building the recommendation system was followed:
#Make recommendations including genre information extracted from classification algorithms.
#Make recommendations without genre information.

The authors used convolutional recurrent neural networks (CRNN), which is a combination of convolutional neural networks (CNN) and recurrent neural network(RNN), as their main classification model.

==Methods and Techniques:==
Generally, a music recommender can be divided into three main parts: (I) users, (ii) items, and (iii) user-item matching algorithms. First, we generated users' music tastes based on their profiles. Second, item profiling includes editorial, cultural, and acoustic metadata were collected for listeners' satisfaction. Finally, we come to the matching algorithm that suggests recommended personalized music to listeners.

To classify music, the original music’s audio signal is converted into a spectrogram image. Using the image and the Short Time Fourier Transform (STFT), we convert the data into the Mel scale which is used in the CNN and CRNN models.
=== Mel Scale: ===
The scale of pitches that are heard by listeners, which translates to equal pitch increments.

[[File:Mel.png|frame|none|Mel Scale on Spectrogram]]

=== Short Time Fourier Transform (STFT): ===
The transformation that determines the sinusoidal frequency of the audio, with a Hanning smoothing function. In the continuous case this is written as: <math>\mathbf{STFT}\{x(t)\}(\tau,\omega) \equiv X(\tau, \omega) = \int_{-\infty}^{\infty} x(t) w(t-\tau) e^{-i \omega t} \, d t </math>

where: <math>w(\tau)</math> is the Hanning smoothing function

=== Convolutional Neural Network (CNN): ===
Neural Network that uses convolution in place of matrix multiplication for some layer calculations. By training the data, weights for inputs are updated to find the most significant data relevant to classification. These convolutional layers gather small groups of data and with kernels, and try to find patterns that can help find features in the overall data. The features are then used for classification. Padding is also used to maintain the data on the edges.

[[File:Convolution.png|thumb|400px|left|Convolution Operation]]
[[File:PaddingKernels.png|thumb|400px|center|Example of Padding (white 0s) and Kernels (blue square)]]

=== Convolutional Recurrent Neural Network (CRNN): ===
Similar Neural Network as CNN, with the addition of a GRU, which is a Recurrent Neural Network (RNN). An RNN is used to treat sequential data, by reusing the activation function of previous nodes to update the output. A Gated Recurrent Unit (GRU) is used to store more long-term memory and will help train the early hidden layers.

[[File:GRU441.png|thumb|400px|left|Gated Recurrent Unit (GRU)]]
[[File:Recurrent441.png|thumb|400px|center|Diagram of General Recurrent Neural Network]]

==Data Screening:==

The authors of this paper used a publicly available music dataset made up of 25,000 30-second songs from the Free Music Archives. To ensure a balanced dataset, only 1000 songs each from the genres of classical, electronic, folk, hip-hop, instrumental, jazz and rock were used in the final model.

[[File:Data441.png|thumb|200px|none|Data sorted by music genre]]

==Implementation:==

=== Modeling Neural Networks ===

As noted previously, both CNNs and CRNNs were used to model the data. The advantage of CRNNs is that they are able to model time sequence patterns in addition to frequency features from the spectrogram, allowing for greater identification of important features. Furthermore, feature vectors produced before the classification stage could be used to improve accuracy.

In implementing the neural networks, the Mel-spectrogram data was split up into training, validation, and test sets at a ratio of 8:1:1 respectively and labelled via one-hot encoding. This made it possible for the categorical data to be labelled correctly for binary classification. As opposed to classical stochastic gradient descent, the authors opted to use Adam optimization to update weights in the training phase. Binary cross-entropy was used as the loss function.

In both the CNN and CRNN models, the data was trained over 100 epochs with a binary cross-entropy loss function. The sigmoid function was used as the output layer.

An overview of the CNN and CRNN architecture can be found in the charts below.

[[File:CNN441.png|thumb|800px|none|Implementation of CNN Model]]
[[File:CRNN441.png|thumb|800px|none|Implementation of CRNN Model]]

=== Music Recommendation System ===

The recommendation system is computed by the cosine similarity of the extraction features from the neural network. Each genre will have a song act as a centre point for each class. The final inputs of the trained neural networks will be the feature variables. The featured variables will be used in the cosine similarity to find the best recommendations.

The values are between [-1,1], where larger values are songs that have similar features. When the user inputs five songs, those songs become the new inputs in the neural networks and the features are used by the cosine similarity with other music. The largest five cosine similarities are used as recommendations.
[[File:Cosine441.png|frame|100px|none|Cosine Similarity]]

== Evaluation Metrics ==
=== Precision: ===
* The proportion of True Positives with respect to the '''predicted''' positive cases (true positives and false positives)
* For example, out of all the songs that the classifier '''predicted''' as Classical, how many are actually Classical?
* Describes the rate at which the classifier predicts the true genre of songs among those predicted to be of that certain genre

=== Recall: ===
* The proportion of True Positives with respect to the '''actual''' positive cases (true positives and false negatives)
* For example, out of all the songs that are '''actually''' Classical, how many are correctly predicted to be Classical?
* Describes the rate at which the classifier predicts the true genre of songs among the correct instances of that genre

=== F1-Score: ===
An accuracy metric that combines the classifier’s precision and recall scores by taking the harmonic mean between the two metrics:

[[File:F1441.png|frame|100px|none|F1-Score]]

=== Receiver operating characteristics (ROC): ===
* A graphical metric that is used to assess a classification model at different classification thresholds
* In the case of a classification threshold of 0.5, this means that if <math>P(Y = k | X = x) > 0.5</math> then we classify this instance as class k
* Plots the true positive rate versus false positive rate as the classification threshold is varied

[[File:ROCGraph.jpg|thumb|400px|none|ROC Graph. Comparison of True Positive Rate and False Positive Rate]]

=== Area Under the Curve (AUC) ===
AUC is the area under the ROC in doing so, the ROC provides an aggregate measure across all possible classification thresholds.

In the context of the paper: When scoring all songs as <math>Prob(Classical | X=x)</math>, it is the probability that the model ranks a random Classical song at a higher probability than a random non-Classical song.

[[File:AUCGraph.jpg|thumb|400px|none|Area under the ROC curve.]]

== Results ==
=== Accuracy Metrics ===
The table below is the accuracy metrics with the classification threshold of 0.5.

[[File:TruePositiveChart.jpg|thumb|none|True Positive / False Positive Chart]]
On average, CRNN outperforms CNN in true positive and false positive cases.

[[File:F1Chart441.jpg|thumb|400px|none|F1 Chart]]
On average, CRNN outperforms CNN in F1-score.

[[File:AUCChart.jpg|thumb|400px|none|AUC Chart]]
On average, CRNN also outperforms CNN in AUC metric.

CRNN models that consider the frequency features and time sequence patterns of songs have a better classification performance through metrics such as F1 score and AUC when comparing to CNN classifier.

=== Evaluation of Music Recommendation System: ===

* A listening experiment was performed with 30 participants to access user responses to given music recommendations.
* Participants choose 5 preferred music and the recommender system gives 5 recommendations; the participants evaluated the music recommendation by recording whether the song was liked or disliked.
* The recommendation system takes two approaches to the recommendation:
** Method one uses only the value of cosine similarity.
** Method two uses the value of cosine similarity and information on music genre.
*Perform test of significance of differences in respondents to the two methods using t-statistic:
[[File:H0441.png|frame|100px|none|Hypothesis test between method 1 and method 2]]

Comparing the two methods, <math> H_0: u_1 - u_2 = 0</math>, we have <math> t_{stat} = -4.743 < -2.037 </math> which concludes that the addition of a music genre information increases -- it is statistically significant.

== Conclusion: ==

Here are two main conclusions obtained from this paper:

- To increase the predictive capabilities of the music recommendation system, the music genre should be a key feature to analyze.

- To extract the song genre from a song’s audio signals and get overall better performance, CRNN’s are superior to CNN’s as they consider frequency in features and time sequence patterns of audio signals.

According to analyses in the paper, the authors also suggested adding other music features like tempo gram for capturing local tempo to improve the accuracy of the recommender system.

== Critiques/ Insights: ==
# The authors fail to give reference to the performance of current recommendation algorithms used in the industry; my critique would be for the authors to bench-mark their novel approach with other recommendation algorithms such as collaborative filtering to see if there is a lift in predictive capabilities.
# The listening experiment used to evaluate the recommendation system only includes songs that are outputted by the model. Users may be biased if they believe all songs have come from a recommendation system. To remove bias, we suggest having 15 songs where 5 songs are recommended and 10 songs are set. With this in the user’s mind, it may remove some bias in response and give more accurate predictive capabilities.
# They could go into more details about how CRNN makes it perform better than CNN, in terms of attributes of each network.
# The methodology introduced in this paper is probably also suitable for movie recommendations. As music is presented as spectrograms (images) in a time sequence, and it is very similar to a movie.
# The way of evaluation is a very interesting approach. Since it's usually not easy to evaluate the testing result when it's subjective. By listing all these evaluations' performance, the result would be more comprehensive.
# The paper lacks the comparison between the proposed algorithm and the music recommendation algorithms being used now. It will be clearer to show the superiority of this algorithm.
# The GAN neural network has been proposed to enhance the performance of the neural network, so an improved result may appear after considering using GAN.
# The limitation of CNN and CRNN could be that they are only able to process the spectrograms with single labels rather than multiple labels. This is far from enough for the music recommender systems in today's music industry since the edges between various genres are blurred.
# according to the author, the recommender system is done by calculating the cosine similarity of extraction features from one music to another music. Is possible to represent it by Euclidean distance or p-norm distances?
# In real-life application, most of the music software will have the ability to recommend music to the listener and ask do they like the music that was recommended. It would be a nice application by involving some new information from the listener.
# This paper is very similar to another paper, (link), written by Bruce Fewerda and Markus Schedl. Both papers are suggesting methods of building music recommendation systems. However, this paper recommends music based on genre, but the paper written by Fewerda and Schedl suggests a personality-based user modeling for music recommender systems.

Music Recommender System Based using CRNN

2020-11-30T00:07:07Z

D287zhan: /* Critiques/ Insights: */

Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms

2020-11-29T23:47:22Z

D287zhan: /* Introduction */

== Presented by ==

Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee

== Introduction ==

This paper presents an approach to detecting heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. For context, ConvNetQuake is a convolutional neural network, used by Perol, Gharbi, and Denolle [4], for Earthquake detection and location from a single waveform. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to healthcare innovation, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.

== Previous Work and Motivation ==

The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo and Attarodi [2] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [3] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [1] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained.

The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.

== Model Architecture ==

The dataset, which was used to train, validate, and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.

This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.

During the training stage, a 10-second long two-channel input was fed into the neural network. In order to ensure that the two channels were weighted equally, both channels were normalized. Besides, time invariance was incorporated by selecting the 10-second long segment randomly from the entire signal.

The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.

This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.

During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.

[[File:architecture.png | thumb | center | 1000px | Model Architecture (Gupta et al., 2019)]]

==Result==

The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting in the highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of the top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly.

Next, they discussed 2 factors affecting model performance evaluation: 1） Random train-val-test split might have effects on the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of the neural network shows little effects on the performance of the model performance evaluation, because of showing high average results with a fixed train-val-test split.

Comparing with other models in the other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records affecting the training accuracy, they used 290 fold patient-wise split, resulting in the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result in lower accuracy evaluation, however, it still maintains a high average of 97.83%.

==Conclusion & Discussion==

The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to a larger labeled data set is needed. The PTB database is small. It is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.

A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.

Fourier Transform(such as FFT) can be helpful when dealing with ECG signals. It transforms signals from time domain to frequency domain, which means some hidden features in frequency may be discovered.

==Critiques==
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.

- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.

- I feel that the results of deep learning models in medical settings where the consequences of misclassification can be severe should be evaluated by assigning weights to classification. In case if the misclassification can lead to severe consequences, then the network should be trained in such a way that it errs towards safety. For example, in case if heart disease, the consequences will be very high if the system says that there is no heart disease when in fact there is. So, the evaluation metric must be selected carefully.

- This is a useful and meaningful application topic in machine learning. Using Deep Learning to detect heart disease can be very helpful if it is difficult to detect disease by looking at ECG by humans eys. This model also useful for doing statistics, such as calculating the percentage of people get heart disease. But I think the doctor should not 100% trust the result from the model, it is almost impossible to get 100% accuracy from a model. So, I think double-checking by human eyes is necessary if the result is weird. What is more, I think it will be interesting to discuss more applications in mediccal by using this method, such as detecting the Brainwave diagram to predict a person's mood and to diagnose mental diseases.

- Compared to the dataset for other topics such as object recognition, the PTB database is pretty small with only 549 ECG records. And these are highly unbiased(Table 1) with 4 records for myocarditis and 148 for myocardial infarction. Medical datasets can only be labeled by specialists. This is why these datasets are related small. It would be great if there will be a larger, more comprehensive dataset.

== References ==

[1] Na Liu et al. "A Simple and Effective Method for Detecting Myocardial Infarction Based on Deep Convolutional Neural Network". In: Journal of Medical Imaging and Health Informatics (Sept. 2018). doi: 10.1166/jmihi.2018.2463.

[2] Naser Safdarian, N.J. Dabanloo, and Gholamreza Attarodi. "A New Pattern Recognition Method for Detection and Localization of Myocardial Infarction Using T-Wave Integral and Total Integral as Extracted Features from One Cycle of ECG Signal". In: J. Biomedical Science and Engineering (Aug. 2014). doi: http://dx.doi.org/10.4236/jbise.2014.710081.

[3] L.D. Sharma and R.K. Sunkaria. "Inferior myocardial infarction detection using stationary wavelet transform and machine learning approach." In: Signal, Image and Video Processing (July 2017). doi: https://doi.org/10.1007/s11760-017-1146-z.

[4] Perol Thibaut, Gharbi Michaël, and Denolle Marin. "Convolutional neural network for earthquake detection and location". In: Science Advances (Feb. 2018). doi: 10.1126/sciadv.1700578

Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms

2020-11-29T23:47:01Z

D287zhan: /* Introduction */

== Presented by ==

Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee

== Introduction ==

This paper presents an approach to detecting heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. For context, ConvNetQuake is a convolutional neural network, used by Perol, Gharbim, and Denolle [4], for Earthquake detection and location from a single waveform. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to healthcare innovation, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.

== Previous Work and Motivation ==

The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo and Attarodi [2] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [3] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [1] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained.

The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.

== Model Architecture ==

The dataset, which was used to train, validate, and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.

This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.

During the training stage, a 10-second long two-channel input was fed into the neural network. In order to ensure that the two channels were weighted equally, both channels were normalized. Besides, time invariance was incorporated by selecting the 10-second long segment randomly from the entire signal.

The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.

This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.

During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.

[[File:architecture.png | thumb | center | 1000px | Model Architecture (Gupta et al., 2019)]]

==Result==

The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting in the highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of the top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly.

Next, they discussed 2 factors affecting model performance evaluation: 1） Random train-val-test split might have effects on the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of the neural network shows little effects on the performance of the model performance evaluation, because of showing high average results with a fixed train-val-test split.

Comparing with other models in the other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records affecting the training accuracy, they used 290 fold patient-wise split, resulting in the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result in lower accuracy evaluation, however, it still maintains a high average of 97.83%.

==Conclusion & Discussion==

The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to a larger labeled data set is needed. The PTB database is small. It is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.

A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.

Fourier Transform(such as FFT) can be helpful when dealing with ECG signals. It transforms signals from time domain to frequency domain, which means some hidden features in frequency may be discovered.

==Critiques==
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.

- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.

- I feel that the results of deep learning models in medical settings where the consequences of misclassification can be severe should be evaluated by assigning weights to classification. In case if the misclassification can lead to severe consequences, then the network should be trained in such a way that it errs towards safety. For example, in case if heart disease, the consequences will be very high if the system says that there is no heart disease when in fact there is. So, the evaluation metric must be selected carefully.

- This is a useful and meaningful application topic in machine learning. Using Deep Learning to detect heart disease can be very helpful if it is difficult to detect disease by looking at ECG by humans eys. This model also useful for doing statistics, such as calculating the percentage of people get heart disease. But I think the doctor should not 100% trust the result from the model, it is almost impossible to get 100% accuracy from a model. So, I think double-checking by human eyes is necessary if the result is weird. What is more, I think it will be interesting to discuss more applications in mediccal by using this method, such as detecting the Brainwave diagram to predict a person's mood and to diagnose mental diseases.

- Compared to the dataset for other topics such as object recognition, the PTB database is pretty small with only 549 ECG records. And these are highly unbiased(Table 1) with 4 records for myocarditis and 148 for myocardial infarction. Medical datasets can only be labeled by specialists. This is why these datasets are related small. It would be great if there will be a larger, more comprehensive dataset.

== References ==

[1] Na Liu et al. "A Simple and Effective Method for Detecting Myocardial Infarction Based on Deep Convolutional Neural Network". In: Journal of Medical Imaging and Health Informatics (Sept. 2018). doi: 10.1166/jmihi.2018.2463.

[2] Naser Safdarian, N.J. Dabanloo, and Gholamreza Attarodi. "A New Pattern Recognition Method for Detection and Localization of Myocardial Infarction Using T-Wave Integral and Total Integral as Extracted Features from One Cycle of ECG Signal". In: J. Biomedical Science and Engineering (Aug. 2014). doi: http://dx.doi.org/10.4236/jbise.2014.710081.

[3] L.D. Sharma and R.K. Sunkaria. "Inferior myocardial infarction detection using stationary wavelet transform and machine learning approach." In: Signal, Image and Video Processing (July 2017). doi: https://doi.org/10.1007/s11760-017-1146-z.

[4] Perol Thibaut, Gharbi Michaël, and Denolle Marin. "Convolutional neural network for earthquake detection and location". In: Science Advances (Feb. 2018). doi: 10.1126/sciadv.1700578

Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms

2020-11-29T23:45:03Z

D287zhan: /* Introduction */

== Presented by ==

Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee

== Introduction ==

This paper presents an approach to detecting heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. For context, ConvNetQuake is a convolutional neural network that has been used in the past for Earthquake detection and location from a single waveform [4]. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to healthcare innovation, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.

== Previous Work and Motivation ==

The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo and Attarodi [2] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [3] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [1] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained.

The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.

== Model Architecture ==

The dataset, which was used to train, validate, and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.

This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.

During the training stage, a 10-second long two-channel input was fed into the neural network. In order to ensure that the two channels were weighted equally, both channels were normalized. Besides, time invariance was incorporated by selecting the 10-second long segment randomly from the entire signal.

The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.

This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.

During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.

[[File:architecture.png | thumb | center | 1000px | Model Architecture (Gupta et al., 2019)]]

==Result==

The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting in the highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of the top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly.

Next, they discussed 2 factors affecting model performance evaluation: 1） Random train-val-test split might have effects on the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of the neural network shows little effects on the performance of the model performance evaluation, because of showing high average results with a fixed train-val-test split.

Comparing with other models in the other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records affecting the training accuracy, they used 290 fold patient-wise split, resulting in the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result in lower accuracy evaluation, however, it still maintains a high average of 97.83%.

==Conclusion & Discussion==

The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to a larger labeled data set is needed. The PTB database is small. It is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.

A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.

Fourier Transform(such as FFT) can be helpful when dealing with ECG signals. It transforms signals from time domain to frequency domain, which means some hidden features in frequency may be discovered.

==Critiques==
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.

- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.

- I feel that the results of deep learning models in medical settings where the consequences of misclassification can be severe should be evaluated by assigning weights to classification. In case if the misclassification can lead to severe consequences, then the network should be trained in such a way that it errs towards safety. For example, in case if heart disease, the consequences will be very high if the system says that there is no heart disease when in fact there is. So, the evaluation metric must be selected carefully.

- This is a useful and meaningful application topic in machine learning. Using Deep Learning to detect heart disease can be very helpful if it is difficult to detect disease by looking at ECG by humans eys. This model also useful for doing statistics, such as calculating the percentage of people get heart disease. But I think the doctor should not 100% trust the result from the model, it is almost impossible to get 100% accuracy from a model. So, I think double-checking by human eyes is necessary if the result is weird. What is more, I think it will be interesting to discuss more applications in mediccal by using this method, such as detecting the Brainwave diagram to predict a person's mood and to diagnose mental diseases.

- Compared to the dataset for other topics such as object recognition, the PTB database is pretty small with only 549 ECG records. And these are highly unbiased(Table 1) with 4 records for myocarditis and 148 for myocardial infarction. Medical datasets can only be labeled by specialists. This is why these datasets are related small. It would be great if there will be a larger, more comprehensive dataset.

== References ==

[1] Na Liu et al. "A Simple and Effective Method for Detecting Myocardial Infarction Based on Deep Convolutional Neural Network". In: Journal of Medical Imaging and Health Informatics (Sept. 2018). doi: 10.1166/jmihi.2018.2463.

[2] Naser Safdarian, N.J. Dabanloo, and Gholamreza Attarodi. "A New Pattern Recognition Method for Detection and Localization of Myocardial Infarction Using T-Wave Integral and Total Integral as Extracted Features from One Cycle of ECG Signal". In: J. Biomedical Science and Engineering (Aug. 2014). doi: http://dx.doi.org/10.4236/jbise.2014.710081.

[3] L.D. Sharma and R.K. Sunkaria. "Inferior myocardial infarction detection using stationary wavelet transform and machine learning approach." In: Signal, Image and Video Processing (July 2017). doi: https://doi.org/10.1007/s11760-017-1146-z.

[4] Perol Thibaut, Gharbi Michaël, and Denolle Marin. "Convolutional neural network for earthquake detection and location". In: Science Advances (Feb. 2018). doi: 10.1126/sciadv.1700578

Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms

2020-11-29T23:44:32Z

D287zhan: /* References */

== Presented by ==

Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee

== Introduction ==

This paper presents an approach to detecting heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. For context, ConvNetQuake is a convolutional neural network that has been used in the past for Earthquake detection and location [4]. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to healthcare innovation, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.

== Previous Work and Motivation ==

The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo and Attarodi [2] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [3] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [1] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained.

The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.

== Model Architecture ==

The dataset, which was used to train, validate, and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.

This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.

During the training stage, a 10-second long two-channel input was fed into the neural network. In order to ensure that the two channels were weighted equally, both channels were normalized. Besides, time invariance was incorporated by selecting the 10-second long segment randomly from the entire signal.

The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.

This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.

During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.

[[File:architecture.png | thumb | center | 1000px | Model Architecture (Gupta et al., 2019)]]

==Result==

The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting in the highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of the top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly.

Next, they discussed 2 factors affecting model performance evaluation: 1） Random train-val-test split might have effects on the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of the neural network shows little effects on the performance of the model performance evaluation, because of showing high average results with a fixed train-val-test split.

Comparing with other models in the other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records affecting the training accuracy, they used 290 fold patient-wise split, resulting in the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result in lower accuracy evaluation, however, it still maintains a high average of 97.83%.

==Conclusion & Discussion==

The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to a larger labeled data set is needed. The PTB database is small. It is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.

A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.

Fourier Transform(such as FFT) can be helpful when dealing with ECG signals. It transforms signals from time domain to frequency domain, which means some hidden features in frequency may be discovered.

==Critiques==
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.

- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.

- I feel that the results of deep learning models in medical settings where the consequences of misclassification can be severe should be evaluated by assigning weights to classification. In case if the misclassification can lead to severe consequences, then the network should be trained in such a way that it errs towards safety. For example, in case if heart disease, the consequences will be very high if the system says that there is no heart disease when in fact there is. So, the evaluation metric must be selected carefully.

- This is a useful and meaningful application topic in machine learning. Using Deep Learning to detect heart disease can be very helpful if it is difficult to detect disease by looking at ECG by humans eys. This model also useful for doing statistics, such as calculating the percentage of people get heart disease. But I think the doctor should not 100% trust the result from the model, it is almost impossible to get 100% accuracy from a model. So, I think double-checking by human eyes is necessary if the result is weird. What is more, I think it will be interesting to discuss more applications in mediccal by using this method, such as detecting the Brainwave diagram to predict a person's mood and to diagnose mental diseases.

- Compared to the dataset for other topics such as object recognition, the PTB database is pretty small with only 549 ECG records. And these are highly unbiased(Table 1) with 4 records for myocarditis and 148 for myocardial infarction. Medical datasets can only be labeled by specialists. This is why these datasets are related small. It would be great if there will be a larger, more comprehensive dataset.

== References ==

[1] Na Liu et al. "A Simple and Effective Method for Detecting Myocardial Infarction Based on Deep Convolutional Neural Network". In: Journal of Medical Imaging and Health Informatics (Sept. 2018). doi: 10.1166/jmihi.2018.2463.

[2] Naser Safdarian, N.J. Dabanloo, and Gholamreza Attarodi. "A New Pattern Recognition Method for Detection and Localization of Myocardial Infarction Using T-Wave Integral and Total Integral as Extracted Features from One Cycle of ECG Signal". In: J. Biomedical Science and Engineering (Aug. 2014). doi: http://dx.doi.org/10.4236/jbise.2014.710081.

[3] L.D. Sharma and R.K. Sunkaria. "Inferior myocardial infarction detection using stationary wavelet transform and machine learning approach." In: Signal, Image and Video Processing (July 2017). doi: https://doi.org/10.1007/s11760-017-1146-z.

[4] Perol Thibaut, Gharbi Michaël, and Denolle Marin. "Convolutional neural network for earthquake detection and location". In: Science Advances (Feb. 2018). doi: 10.1126/sciadv.1700578

Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms

2020-11-29T23:40:07Z

D287zhan: /* Introduction */

== Presented by ==

Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee

== Introduction ==

This paper presents an approach to detecting heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. For context, ConvNetQuake is a convolutional neural network that has been used in the past for Earthquake detection and location [4]. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to healthcare innovation, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.

== Previous Work and Motivation ==

The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo and Attarodi [2] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [3] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [1] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained.

The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.

== Model Architecture ==

The dataset, which was used to train, validate, and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.

This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.

During the training stage, a 10-second long two-channel input was fed into the neural network. In order to ensure that the two channels were weighted equally, both channels were normalized. Besides, time invariance was incorporated by selecting the 10-second long segment randomly from the entire signal.

The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.

This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.

During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.

[[File:architecture.png | thumb | center | 1000px | Model Architecture (Gupta et al., 2019)]]

==Result==

The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting in the highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of the top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly.

Next, they discussed 2 factors affecting model performance evaluation: 1） Random train-val-test split might have effects on the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of the neural network shows little effects on the performance of the model performance evaluation, because of showing high average results with a fixed train-val-test split.

Comparing with other models in the other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records affecting the training accuracy, they used 290 fold patient-wise split, resulting in the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result in lower accuracy evaluation, however, it still maintains a high average of 97.83%.

==Conclusion & Discussion==

The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to a larger labeled data set is needed. The PTB database is small. It is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.

A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.

Fourier Transform(such as FFT) can be helpful when dealing with ECG signals. It transforms signals from time domain to frequency domain, which means some hidden features in frequency may be discovered.

==Critiques==
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.

- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.

- I feel that the results of deep learning models in medical settings where the consequences of misclassification can be severe should be evaluated by assigning weights to classification. In case if the misclassification can lead to severe consequences, then the network should be trained in such a way that it errs towards safety. For example, in case if heart disease, the consequences will be very high if the system says that there is no heart disease when in fact there is. So, the evaluation metric must be selected carefully.

- This is a useful and meaningful application topic in machine learning. Using Deep Learning to detect heart disease can be very helpful if it is difficult to detect disease by looking at ECG by humans eys. This model also useful for doing statistics, such as calculating the percentage of people get heart disease. But I think the doctor should not 100% trust the result from the model, it is almost impossible to get 100% accuracy from a model. So, I think double-checking by human eyes is necessary if the result is weird. What is more, I think it will be interesting to discuss more applications in mediccal by using this method, such as detecting the Brainwave diagram to predict a person's mood and to diagnose mental diseases.

- Compared to the dataset for other topics such as object recognition, the PTB database is pretty small with only 549 ECG records. And these are highly unbiased(Table 1) with 4 records for myocarditis and 148 for myocardial infarction. Medical datasets can only be labeled by specialists. This is why these datasets are related small. It would be great if there will be a larger, more comprehensive dataset.

== References ==

[1] Na Liu et al. "A Simple and Effective Method for Detecting Myocardial Infarction Based on Deep Convolutional Neural Network". In: Journal of Medical Imaging and Health Informatics (Sept. 2018). doi: 10.1166/jmihi.2018.2463.

[2] Naser Safdarian, N.J. Dabanloo, and Gholamreza Attarodi. "A New Pattern Recognition Method for Detection and Localization of Myocardial Infarction Using T-Wave Integral and Total Integral as Extracted Features from One Cycle of ECG Signal". In: J. Biomedical Science and Engineering (Aug. 2014). doi: http://dx.doi.org/10.4236/jbise.2014.710081.

[3] L.D. Sharma and R.K. Sunkaria. "Inferior myocardial infarction detection using stationary wavelet transform and machine learning approach." In: Signal, Image and Video Processing (July 2017). doi: https://doi.org/10.1007/s11760-017-1146-z.

Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms

2020-11-29T23:39:54Z

D287zhan: /* Introduction */

== Presented by ==

Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee

== Introduction ==

This paper presents an approach to detecting heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. For context, ConvNetQuake is a convolutional neural network that has been used in the past for Earthquake detection and location. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to healthcare innovation, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.

== Previous Work and Motivation ==

The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo and Attarodi [2] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [3] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [1] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained.

The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.

== Model Architecture ==

The dataset, which was used to train, validate, and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.

This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.

During the training stage, a 10-second long two-channel input was fed into the neural network. In order to ensure that the two channels were weighted equally, both channels were normalized. Besides, time invariance was incorporated by selecting the 10-second long segment randomly from the entire signal.

The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.

This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.

During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.

[[File:architecture.png | thumb | center | 1000px | Model Architecture (Gupta et al., 2019)]]

==Result==

The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting in the highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of the top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly.

Next, they discussed 2 factors affecting model performance evaluation: 1） Random train-val-test split might have effects on the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of the neural network shows little effects on the performance of the model performance evaluation, because of showing high average results with a fixed train-val-test split.

Comparing with other models in the other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records affecting the training accuracy, they used 290 fold patient-wise split, resulting in the same highest accuracy of the pair v6 and vz same as record-wise split. Even though the patient-wise split might result in lower accuracy evaluation, however, it still maintains a high average of 97.83%.

==Conclusion & Discussion==

The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to a larger labeled data set is needed. The PTB database is small. It is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.

A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.

Fourier Transform(such as FFT) can be helpful when dealing with ECG signals. It transforms signals from time domain to frequency domain, which means some hidden features in frequency may be discovered.

==Critiques==
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.

- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.

- I feel that the results of deep learning models in medical settings where the consequences of misclassification can be severe should be evaluated by assigning weights to classification. In case if the misclassification can lead to severe consequences, then the network should be trained in such a way that it errs towards safety. For example, in case if heart disease, the consequences will be very high if the system says that there is no heart disease when in fact there is. So, the evaluation metric must be selected carefully.

- This is a useful and meaningful application topic in machine learning. Using Deep Learning to detect heart disease can be very helpful if it is difficult to detect disease by looking at ECG by humans eys. This model also useful for doing statistics, such as calculating the percentage of people get heart disease. But I think the doctor should not 100% trust the result from the model, it is almost impossible to get 100% accuracy from a model. So, I think double-checking by human eyes is necessary if the result is weird. What is more, I think it will be interesting to discuss more applications in mediccal by using this method, such as detecting the Brainwave diagram to predict a person's mood and to diagnose mental diseases.

- Compared to the dataset for other topics such as object recognition, the PTB database is pretty small with only 549 ECG records. And these are highly unbiased(Table 1) with 4 records for myocarditis and 148 for myocardial infarction. Medical datasets can only be labeled by specialists. This is why these datasets are related small. It would be great if there will be a larger, more comprehensive dataset.

== References ==

[1] Na Liu et al. "A Simple and Effective Method for Detecting Myocardial Infarction Based on Deep Convolutional Neural Network". In: Journal of Medical Imaging and Health Informatics (Sept. 2018). doi: 10.1166/jmihi.2018.2463.

[2] Naser Safdarian, N.J. Dabanloo, and Gholamreza Attarodi. "A New Pattern Recognition Method for Detection and Localization of Myocardial Infarction Using T-Wave Integral and Total Integral as Extracted Features from One Cycle of ECG Signal". In: J. Biomedical Science and Engineering (Aug. 2014). doi: http://dx.doi.org/10.4236/jbise.2014.710081.

[3] L.D. Sharma and R.K. Sunkaria. "Inferior myocardial infarction detection using stationary wavelet transform and machine learning approach." In: Signal, Image and Video Processing (July 2017). doi: https://doi.org/10.1007/s11760-017-1146-z.

Music Recommender System Based using CRNN

2020-11-28T22:56:08Z

D287zhan: /* Introduction and Objective: */

==Introduction and Objective:==

In the digital era of music streaming, companies, such as Spotify and Pandora, are faced with the following challenge: how to provide users with relevant and personalized music recommendations amidst the ever-growing abundance of music and user data.

The objective of this paper is to implement a personalized music recommender system that takes user listening history as input and continually finds new music that captures individual user preferences.

The authors of this paper argue that a music recommendation system should vary from the general recommendation system used in practice since it should combine music feature recognition and audio processing technologies to extract music features, and combine them with data on user preferences.

The authors of this paper took a content-based music approach to build the recommendation system - specifically, comparing the similarity of features based on the audio signal.

The following two-method approach to building the recommendation system was followed:
#Make recommendations including genre information extracted from classification algorithms.
#Make recommendations without genre information.

The authors used convolutional recurrent neural networks (CRNN), which is a combination of convolutional neural networks (CNN) and recurrent neural network(RNN), as their main classification model.

==Methods and Techniques:==

The original music’s audio signal is converted into a spectrogram image. Using the image and the Short Time Fourier Transform (STFT), we convert the data into the Mel scale which is used in the CNN and CRNN models.
=== Mel Scale: ===
Scale of pitches that are heard by listeners, which translates to equal pitch increments.

[[File:Mel.png|frame|none|Mel Scale on Spectrogram]]

=== Short Time Fourier Transform (STFT): ===
Transformation that determines the sinusoidal frequency of the audio, with a Hanning smoothing function.
=== Convolutional Neural Network (CNN): ===
Neural Network that uses convolution in place of matrix multiplication for some layer calculations. By training the data, weights for inputs are updated to find the most significant data relevant to classification. These convolutional layers gather small groups of data and with kernels, and try to find patterns that can help find features in the overall data. The features are then used for classification. Padding is also used to maintain the data on the edges.

[[File:Convolution.png|thumb|400px|left|Convolution Operation]]
[[File:PaddingKernels.png|thumb|400px|center|Example of Padding (white 0s) and Kernels (blue square)]]

=== Convolutional Recurrent Neural Network (CRNN): ===
Similar Neural Network as CNN, with the addition of a GRU, which is a Recurrent Neural Network (RNN). A RNN is used to treat sequential data, by reusing the activation function of previous nodes to update the output. A Gated Recurrent Unit (GRU) is used to store more long-term memory and will help train the early hidden layers.

[[File:GRU441.png|thumb|400px|left|Gated Recurrent Unit (GRU)]]
[[File:Recurrent441.png|thumb|400px|center|Diagram of General Recurrent Neural Network]]

==Data Screening:==

The authors of this paper used a publicly available music dataset made up of 25,000 30 second songs from the Free Music Archives. To ensure a balanced dataset, only 1000 songs each from the genres of classical, electronic, folk, hip-hop, instrumental, jazz and rock were used in the final model.

[[File:Data441.png|thumb|200px|none|Data sorted by music genre]]

==Implementation:==

=== Modeling Neural Networks ===

As noted previously, both CNNs and CRNNs were used to model the data. The advantage of CRNNs is that they are able to model time sequence patterns in addition to frequency features from the spectrogram, allowing for greater identification of important features. Furthermore, feature vectors produced before the classification stage could be used to improve accuracy.

In implementing the neural networks, the Mel-spectrogram data was split up into training, validation, and test sets at a ratio of 8:1:1 respectively and labelled via one-hot encoding. This made it possible for the categorical data to be labelled correctly for binary classification. As opposed to classical stochastic gradient descent, the authors opted to use Adam optimization to update weights in the training phase. Binary cross-entropy was used as the loss function.

In both the CNN and CRNN models, the data was trained over 100 epochs with a binary cross-entropy loss function. The sigmoid function was used as the output layer.

An overview of the CNN and CRNN architecture can be found in the charts below.

[[File:CNN441.png|thumb|800px|none|Implementation of CNN Model]]
[[File:CRNN441.png|thumb|800px|none|Implementation of CRNN Model]]

=== Music Recommendation System ===

The recommendation system is computed by the cosine similarity of the extraction features from the neural network. Each genre will have a song act as a centre point for each class. The final inputs of the trained neural networks will be the feature variables. The featured variables will be used in the cosine similarity to find the best recommendations.

The values are between [-1,1], where larger values are songs that have similar features. When the user inputs five songs, those songs become the new inputs in the neural networks and the features are used by the cosine similarity with other music. The largest five cosine similarities are used as recommendations.
[[File:Cosine441.png|frame|100px|none|Cosine Similarity]]

== Evaluation Metrics ==
=== Precision: ===
* The proportion of True Positives with respect to the '''predicted''' positive cases (true positives and false positives)
* For example, out of all the songs that the classifier '''predicted''' as Classical, how many are actually Classical?
* Describes the rate at which the classifier predicts the true genre of songs among those predicted to be of that certain genre

=== Recall: ===
* The proportion of True Positives with respect to the '''actual''' positive cases (true positives and false negatives)
* For example, out of all the songs that are '''actually''' Classical, how many are correctly predicted to be Classical?
* Describes the rate at which the classifier predicts the true genre of songs among the correct instances of that genre

=== F1-Score: ===
An accuracy metric that combines the classifier’s precision and recall scores by taking the harmonic mean between the two metrics:

[[File:F1441.png|frame|100px|none|F1-Score]]

=== Receiver operating characteristics (ROC): ===
* A graphical metric that is used to assess a classification model at different classification thresholds
* In the case of a classification threshold of 0.5, this means that if <math>P(Y = k | X = x) > 0.5</math> then we classify this instance as class k
* Plots the true positive rate versus false positive rate as the classification threshold is varied

[[File:ROCGraph.jpg|thumb|400px|none|ROC Graph. Comparison of True Positive Rate and False Positive Rate]]

=== Area Under the Curve (AUC) ===
AUC is the area under the ROC in doing so, the ROC provides an aggregate measure across all possible classification thresholds.

In the context of the paper: When scoring all songs as <math>Prob(Classical | X=x)</math>, it is the probability that the model ranks a random Classical song at a higher probability than a random non-Classical song.

[[File:AUCGraph.jpg|thumb|400px|none|Area under the ROC curve.]]

== Results ==
=== Accuracy Metrics ===
Looking at the accuracy metrics at the classification threshold of 0.5:

[[File:TruePositiveChart.jpg|thumb|none|True Positive / False Positive Chart]]
On average, CRNN outperforms CNN in true positive and false positive cases

[[File:F1Chart441.jpg|thumb|400px|none|F1 Chart]]
On average, CRNN outperforms CNN in F1-score

[[File:AUCChart.jpg|thumb|400px|none|AUC Chart]]
On average, CRNN also outperforms CNN in AUC metric

CRNN models that considers the frequency features and time sequence patterns of songs have a better classification performance through metrics such as F1 score and AUC when comparing to CNN classifier.

=== Evaluation of Music Recommendation System: ===

* A listening experiment was performed with 30 participants to access user responses to given music recommendations.
* Participants choose 5 preferred music and the recommender system gives 5 recommendations; the participants evaluated the music recommendation by recording whether the song was liked or disliked.
* The recommendation system takes two approach to recommendation:
** Method one uses only the value of cosine similarity
** Method two uses the value of cosine similarity and information on music genre
*Perform test of significance of differences in respondents to the two methods using t-statistic
[[File:H0441.png|frame|100px|none|Hypothesis test between method 1 and method 2]]

Comparing the two methods, <math> H_0: u_1 - u_2 = 0</math>, we have <math> t_{stat} = -4.743 < -2.037 </math> which concludes that the addition of a music genre information increases -- it is statistically significant

== Conclusion: ==

To increase the predictive capabilities of the music recommendation system, song genre should be a key feature.
To extract the song genre from a song’s audio signals, CRNN’s are superior to CNN’s as they consider frequency in features and time sequence patterns of audio signals.

== Critiques/ Insights: ==
#The authors fail to give reference to the performance of current recommendation algorithms used in industry; my critique would be for the authors to bench-mark their novel approach with other recommendation algorithms such as collaborative filtering to see if there is lift in predictive capabilities.
#The listening experiment used to evaluate the recommendation system only includes songs that are outputted by the model. Users may be biased if they believe all songs have come from a recommendation system. To remove bias, we suggest to have 15 songs where 5 songs are recommended and 10 songs are set. With this in the user’s mind it may remove some bias in response and give more accurate predictive capabilities.
#They could go into more details about how CRNN makes it perform better than CNN, in terms of attributes of each network.
# The methodology introduced in this paper is probably also suitable for movie recommendation. As music is presented as spectrograms (images) in a time sequence, and it is very similar to a movie.
# The way of evaluation is a very interesting approach. Since it's usually not easy to evaluate the testing result when it's subjective. By listing all these evaluations' performance, the result would be more comprehensive.
# The paper lacks the comparison between the proposed algorithm and the music recommendation algorithms being used now. It will be clearer to show the superiority of this algorithm.
# The GAN neural network has been proposed to enhance the performance of the neural network, so an improved result may appear after considering using GAN.

Surround Vehicle Motion Prediction

2020-11-28T22:50:26Z

D287zhan: /* Network architecture */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on urban road, there are 3 categories for motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which often used in offline simulations.

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples and 4,998 evaluation data samples.

=== Motion predictor ===
This article propose a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
==== Encoder and decoder ====
In this study, authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Squence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows a significantly less prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell shaped
cure with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== Motion planning application ===
==== Case study of a multi-lane left turn scenario ====
The proposed method mimics a human driver better, by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle, even when the target vehicle was not following the intersection guideline.

==== Statistical analysis of motion planning application results ====
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behavior of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math>
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, base algorithm
possesses limited ability to respond to different in-lane target behaviors in traffic flow. Hence, the proposed
model is efficient and safety.

== Conclusion ==
A surrounding vehicle motion predictor based on a LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with other three base algorithms (CV/Path, V_flow/Path and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of road. Why the LSTM-RNN is used, and the background of method are not stated clearly. There is lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance on different algorithms or some other traditional motion planning algorithms like KF.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

Surround Vehicle Motion Prediction

2020-11-28T22:47:47Z

D287zhan: /* Case study of a multi-lane left turn scenario */

DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections
== Presented by ==
Mushi Wang, Siyuan Qiu, Yan Yu

== Introduction ==

This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections is described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability

== Previous Work ==
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on urban road, there are 3 categories for motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which often used in offline simulations.

== Motivation ==
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.

== Framework ==
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.

<center>[[Image:Figure1_Yan.png|800px|]]</center>

== LSTM-RNN based motion predictor ==

=== Data ===
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples and 4,998 evaluation data samples.

=== Motion predictor ===
This article propose a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view.

<center>[[Image:Figure7b_Yan.png|500px|]]</center>

==== Network architecture ====
RNN is an artificial neural network, suitable for use with sequential data. RNN can also be used for time series data, where the pattern of the data depends on the time flow. It can contain feedback loops that allow activations to flow alternately in the loop.
LSTM can avoid the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network training improperly.The figure below shows the various layers of LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.

<center>[[Image:Figure8_Yan.png|800px|]]</center>

==== Input and output features ====
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.
==== Encoder and decoder ====
In this study, authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit.
==== Squence length ====
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.

== Motion planning based on surrounding vehicle motion prediction ==
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:
\begin{equation*}
\begin{split}
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2
\end{split}
\end{equation*}
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles.
The constraints of the control input are defined as follows:
\begin{equation*}
\begin{split}
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\
&||\mu(k+1|t) - \mu(k|t)|| \leq S
\end{split}
\end{equation*}
Determine the position and speed boundary based on the predicted state:
\begin{equation*}
\begin{split}
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0
\end{split}
\end{equation*}
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.

== Prediction performance analysis and application to motion planning ==
=== Accuracy analysis ===
The proposed algorithm was compared with the results from three base algorithms, a path-following model with
constant velocity, a path-following model with traffic flow and a CTRV model.

We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>,
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:

\begin{equation*}
\begin{split}
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}
\end{split}
\end{equation*}

The proposed model shows a significantly less prediction errors compare to the based algorithms in terms of mean,
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell shaped
cure with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers'
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore,
the proposed algorithm can be precise and maintain safety simultaneously.

=== Motion planning application ===
==== Case study of a multi-lane left turn scenario ====
The proposed method mimics a human driver better, by simulating a human driver's decision-making process.
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target
vehicle, even when the target vehicle was not following the intersection guideline.

==== Statistical analysis of motion planning application results ====
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means
that these cases took place sufficiently beyond the safety distance, and had little influence on determining
the behavior of the subject vehicle.

In order to compare the similarities between the results form the proposed algorithm and human driving decisions,
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math>
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm,
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, base algorithm
possesses limited ability to respond to different in-lane target behaviors in traffic flow. Hence, the proposed
model is efficient and safety.

== Conclusion ==
A surrounding vehicle motion predictor based on a LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with other three base algorithms (CV/Path, V_flow/Path and CTRV) revealed the superiority of the proposed algorithm.

== Future works ==
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.

2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.

3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.

4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.

== Critiques ==
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of road. Why the LSTM-RNN is used, and the background of method are not stated clearly. There is lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.

This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.

Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance on different algorithms or some other traditional motion planning algorithms like KF.

== Reference ==
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.

[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.

[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.

[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T22:28:00Z

D287zhan: /* Results */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg]]

Figure 1. '''Speech2Face model and training pipeline'''

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

Table 1: '''Voice encoder architecture'''

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

Figure 2: '''Qualitative results on the AVSpeech test set'''

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

Figure 3. '''Facial attribute evaluation'''

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

Table 2. '''Feature similarity'''

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

Table 3. '''S2F -> Face retrieval performance'''

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From Table 3, above, we see that both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# There is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. Testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T22:25:46Z

D287zhan: /* Results */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg]]

Figure 1. '''Speech2Face model and training pipeline'''

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

Table 1: '''Voice encoder architecture'''

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

Figure 2: '''Qualitative results on the AVSpeech test set'''

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

Figure 3. '''Facial attribute evaluation'''

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

Table 2. '''Feature similarity'''

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

Table 3. '''S2F -> Face retrieval performance'''

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# There is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. Testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T22:20:25Z

D287zhan: /* Results */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg]]

Figure 1. '''Speech2Face model and training pipeline'''

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

Table 1: '''Voice encoder architecture'''

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

Figure 2: '''Qualitative results on the AVSpeech test set'''

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

Figure 3: '''Facial attribute evaluation'''

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# There is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. Testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T22:20:03Z

D287zhan: /* Results */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg]]

Figure 1. '''Speech2Face model and training pipeline'''

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

Table 1: '''Voice encoder architecture'''

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

Figure 2: '''Qualitative results on the AVSpeech test set'''

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

Figure 3: '''Facial attribute evaluation'''

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# There is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. Testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T22:18:25Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg]]

Figure 1. '''Speech2Face model and training pipeline'''

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

Table 1: '''Voice encoder architecture'''

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

Figure 2: '''Qualitative results on the AVSpeech test set'''

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# There is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. Testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T22:14:56Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg]]

Figure 1. '''Speech2Face model and training pipeline'''

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

Table 1: '''Voice encoder architecture'''

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# There is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. Testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T22:14:41Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg]]

Figure 1. '''Speech2Face model and training pipeline'''

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

Table 1: '''Voice encoder architecture'''

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# There is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. Testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T22:13:53Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg]]

Figure 1. '''Speech2Face model and training pipeline'''

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given above * * *. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. (cite), a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# There is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. Testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T22:12:29Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg]]

Figure 1. '''Speech2Face model and training pipeline'''

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). The image above * * * gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al (cite) and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given above * * *. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. (cite), a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al (cite), <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# There is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. Testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T22:12:19Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg]]

Figure 1. '''Speech2Face model and training pipeline'''

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). The image above * * * gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al (cite) and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given above * * *. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. (cite), a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al (cite), <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# There is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. Testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T22:11:24Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg]]

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). The image above * * * gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al (cite) and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given above * * *. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. (cite), a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al (cite), <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

<center>
[[File:L1vsTotalLoss.png | 700px]]
</center>

== Results ==

'''Confusion Matrix and Dataset statistics'''

<center>
[[File:Confusionmatrix.png| 600px]]
</center>

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

'''Feature Similarity'''

<center>
[[File:FeatSim.JPG]]
</center>

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

<center>
[[File: Retrieval.JPG]]
</center>

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# There is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. Testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

File:ModelFrameworkupdated.jpg

2020-11-28T21:35:26Z

D287zhan:

Speech2Face: Learning the Face Behind a Voice

2020-11-28T21:35:05Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFrameworkupdated.jpg]]

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). The image above * * * gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al (cite) and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.
'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given above * * *. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Face Decoder Architecture'''

The face decoder reconstructs the face from low-dimensional face features. Irrelevant variations like pose and lighting were factored out while still preserving the core facial features. To do this the face decoder built by Cole et al (cite) was used. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>. This provides the supervision signal for the voice-encoder. the feature <math>v_s</math> of the voice encoder is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. Let <math>v_s</math> be the 4096 dimensional facial feature vector from the voice encoder, and <math>v_f</math> be the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. (cite), a loss function which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al (cite), <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

[[File:L1vsTotalLoss.png]]

From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

== Results ==

'''Confusion Matrix and Dataset statistics'''

[[File: ConfMat.JPG]]

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 0.12% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

* * Fix image, remove stats* *

'''Feature Similarity'''

[[File:FeatSim.JPG]]

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

[[File: Retrieval.JPG]]

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problme was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# Their is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T21:34:55Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFramework.jpg]]

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). The image above * * * gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al (cite) and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.
'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture given above * * *. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Face Decoder Architecture'''

The face decoder reconstructs the face from low-dimensional face features. Irrelevant variations like pose and lighting were factored out while still preserving the core facial features. To do this the face decoder built by Cole et al (cite) was used. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>. This provides the supervision signal for the voice-encoder. the feature <math>v_s</math> of the voice encoder is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. Let <math>v_s</math> be the 4096 dimensional facial feature vector from the voice encoder, and <math>v_f</math> be the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. (cite), a loss function which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al (cite), <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

[[File:L1vsTotalLoss.png]]

From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

== Results ==

'''Confusion Matrix and Dataset statistics'''

[[File: ConfMat.JPG]]

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 0.12% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

* * Fix image, remove stats* *

'''Feature Similarity'''

[[File:FeatSim.JPG]]

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

[[File: Retrieval.JPG]]

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problme was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# Their is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

File:ModelFrameworknew.jpg

2020-11-28T21:32:51Z

D287zhan:

Speech2Face: Learning the Face Behind a Voice

2020-11-28T21:32:39Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

[[File:ModelFrameworknew.jpg]]

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). The image above * * * gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al (cite) and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.
'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture given above * * *. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Face Decoder Architecture'''

The face decoder reconstructs the face from low-dimensional face features. Irrelevant variations like pose and lighting were factored out while still preserving the core facial features. To do this the face decoder built by Cole et al (cite) was used. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>. This provides the supervision signal for the voice-encoder. the feature <math>v_s</math> of the voice encoder is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. Let <math>v_s</math> be the 4096 dimensional facial feature vector from the voice encoder, and <math>v_f</math> be the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. (cite), a loss function which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al (cite), <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

[[File:L1vsTotalLoss.png]]

From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

== Results ==

'''Confusion Matrix and Dataset statistics'''

[[File: ConfMat.JPG]]

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 0.12% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

* * Fix image, remove stats* *

'''Feature Similarity'''

[[File:FeatSim.JPG]]

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

[[File: Retrieval.JPG]]

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problme was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# Their is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T21:29:59Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.

[[File:ModelFramework.jpg]]

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). The image above * * * gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al (cite) and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture given above * * *. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Face Decoder Architecture'''

The face decoder reconstructs the face from low-dimensional face features. Irrelevant variations like pose and lighting were factored out while still preserving the core facial features. To do this the face decoder built by Cole et al (cite) was used. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>. This provides the supervision signal for the voice-encoder. the feature <math>v_s</math> of the voice encoder is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. Let <math>v_s</math> be the 4096 dimensional facial feature vector from the voice encoder, and <math>v_f</math> be the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. (cite), a loss function which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al (cite), <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

[[File:L1vsTotalLoss.png]]

From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

== Results ==

'''Confusion Matrix and Dataset statistics'''

[[File: ConfMat.JPG]]

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 0.12% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

* * Fix image, remove stats* *

'''Feature Similarity'''

[[File:FeatSim.JPG]]

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

[[File: Retrieval.JPG]]

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problme was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# Their is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T21:27:03Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the deign and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.

'''Figure 1:'''

[[File:ModelFramework.jpg]]

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). The image above * * * gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al (cite) and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture given above * * *. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Face Decoder Architecture'''

The face decoder reconstructs the face from low-dimensional face features. Irrelevant variations like pose and lighting were factored out while still preserving the core facial features. To do this the face decoder built by Cole et al (cite) was used. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair colour, may be predicted by the model.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>. This provides the supervision signal for the voice-encoder. the feature <math>v_s</math> of the voice encoder is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. Let <math>v_s</math> be the 4096 dimensional facial feature vector from the voice encoder, and <math>v_f</math> be the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. (cite), a loss function which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al (cite), <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

[[File:L1vsTotalLoss.png]]

From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

== Results ==

'''Confusion Matrix and Dataset statistics'''

[[File: ConfMat.JPG]]

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 0.12% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

* * Fix image, remove stats* *

'''Feature Similarity'''

[[File:FeatSim.JPG]]

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

[[File: Retrieval.JPG]]

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problme was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# Their is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T21:26:48Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the deign and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.

Figure 1:

[[File:ModelFramework.jpg]]

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). The image above * * * gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al (cite) and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture given above * * *. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Face Decoder Architecture'''

The face decoder reconstructs the face from low-dimensional face features. Irrelevant variations like pose and lighting were factored out while still preserving the core facial features. To do this the face decoder built by Cole et al (cite) was used. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is a composed of educational videos and does not provide an accurate representation of the global population, which will affect the model. Also note that facial features that are irrelevant to speech, like hair colour may be predicted by the model, if that feature is common to people who speak in a similar way.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>. This provides the supervision signal for the voice-encoder. the feature <math>v_s</math> of the voice encoder is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. Let <math>v_s</math> be the 4096 dimensional facial feature vector from the voice encoder, and <math>v_f</math> be the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. (cite), a loss function which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al (cite), <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

[[File:L1vsTotalLoss.png]]

From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

== Results ==

'''Confusion Matrix and Dataset statistics'''

[[File: ConfMat.JPG]]

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 0.12% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

* * Fix image, remove stats* *

'''Feature Similarity'''

[[File:FeatSim.JPG]]

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

[[File: Retrieval.JPG]]

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problme was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# Their is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T21:26:39Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the deign and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.

Figure 1:

[[File:ModelFramework.jpg]]

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). The image above * * * gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al (cite) and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture given above * * *. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Face Decoder Architecture'''

The face decoder reconstructs the face from low-dimensional face features. Irrelevant variations like pose and lighting were factored out while still preserving the core facial features. To do this the face decoder built by Cole et al (cite) was used. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is a composed of educational videos and does not provide an accurate representation of the global population, which will affect the model. Also note that facial features that are irrelevant to speech, like hair colour may be predicted by the model, if that feature is common to people who speak in a similar way.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>. This provides the supervision signal for the voice-encoder. the feature <math>v_s</math> of the voice encoder is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. Let <math>v_s</math> be the 4096 dimensional facial feature vector from the voice encoder, and <math>v_f</math> be the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. (cite), a loss function which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al (cite), <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

[[File:L1vsTotalLoss.png]]

From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

== Results ==

'''Confusion Matrix and Dataset statistics'''

[[File: ConfMat.JPG]]

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 0.12% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

* * Fix image, remove stats* *

'''Feature Similarity'''

[[File:FeatSim.JPG]]

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

[[File: Retrieval.JPG]]

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problme was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# Their is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T21:26:22Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the deign and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.

Figure 1:
[[File:ModelFramework.jpg]]

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). The image above * * * gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al (cite) and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture given above * * *. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Face Decoder Architecture'''

The face decoder reconstructs the face from low-dimensional face features. Irrelevant variations like pose and lighting were factored out while still preserving the core facial features. To do this the face decoder built by Cole et al (cite) was used. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is a composed of educational videos and does not provide an accurate representation of the global population, which will affect the model. Also note that facial features that are irrelevant to speech, like hair colour may be predicted by the model, if that feature is common to people who speak in a similar way.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>. This provides the supervision signal for the voice-encoder. the feature <math>v_s</math> of the voice encoder is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. Let <math>v_s</math> be the 4096 dimensional facial feature vector from the voice encoder, and <math>v_f</math> be the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. (cite), a loss function which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al (cite), <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

[[File:L1vsTotalLoss.png]]

From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

== Results ==

'''Confusion Matrix and Dataset statistics'''

[[File: ConfMat.JPG]]

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 0.12% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

* * Fix image, remove stats* *

'''Feature Similarity'''

[[File:FeatSim.JPG]]

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

[[File: Retrieval.JPG]]

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problme was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# Their is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.

Speech2Face: Learning the Face Behind a Voice

2020-11-28T21:26:01Z

D287zhan: /* Model Architecture */

== Presented by ==
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang

== Introduction ==
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.

== Previous Work ==
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.

Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.

== Motivation ==
Often, when people listen to a person speaking, without seeing his/her face, whether it is on the phone or on the radio, they build a mental image in their head for what we think that person may look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.

== Model Architecture ==

'''Speech2Face model and training pipeline'''

The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the deign and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database (cite) is used to extract a 4069-D face feature from the penultimate layer of the network.

[[File:ModelFramework.jpg]]
Figure 1:

The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). The image above * * * gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al (cite) and will not be explored in great detail here, but in essence the facenet model (cite) is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image.

'''Voice Encoder Architecture'''

[[File:VoiceEncoderArch.JPG]]

The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture given above * * *. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time, and allows the model to be used for input speeches of varying length. Two fully connected layers at the end are used to return a 4096 dimensional facial feature output.

'''Face Decoder Architecture'''

The face decoder reconstructs the face from low-dimensional face features. Irrelevant variations like pose and lighting were factored out while still preserving the core facial features. To do this the face decoder built by Cole et al (cite) was used. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training.

'''Training'''

The AVSSpeech dataset, a large scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is a composed of educational videos and does not provide an accurate representation of the global population, which will affect the model. Also note that facial features that are irrelevant to speech, like hair colour may be predicted by the model, if that feature is common to people who speak in a similar way.

The voice encoder is trained in a self-supervised manner. A frame which contains the face is extracted from each video and then inputed to the VGG-Face model to extract the feature vector <math>v_f</math>. This provides the supervision signal for the voice-encoder. the feature <math>v_s</math> of the voice encoder is trained to predict <math>v_f</math>.

In order to train this model, a proper loss function must be defined. Let <math>v_s</math> be the 4096 dimensional facial feature vector from the voice encoder, and <math>v_f</math> be the 4096 dimensional facial feature vector given by the face decoder on a single frame from the input video. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. The image below shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss * * *. Based on the work of Castrejon et al. (cite), a loss function which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al (cite), <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.

[[File:L1vsTotalLoss.png]]

From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.

== Results ==

'''Confusion Matrix and Dataset statistics'''

[[File: ConfMat.JPG]]

In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. The following image * * * gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 0.12% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or asian face.

* * Fix image, remove stats* *

'''Feature Similarity'''

[[File:FeatSim.JPG]]

Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and true facial feature vector from the face decoder were computed, and presented above * * *. A comparison of facial similarity was also done based on the length of audio inputted. From the table, it is evident that the 6 second audio produced a lower cosine, L1, and L2, resulting in a facial feature vector that is closer to the ground truth.

'''S2f -> Face retrieval performance'''

[[File: Retrieval.JPG]]

The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From the table, both the 3 second and 6 second audio showed significant improvement over random chance, with the 6 second audio performing slightly better.

== Conclusion ==
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problme was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.

== Discussion and Critiques ==

# Their is evidence that the results of the model may be heavily influenced by external factors. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the models prediction of ethnicity towards white. The bias in the results show that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Also the model was shown to infer different faces features based on language. This puts into question how heavily the model depends on the spoken language. testing a more controlled sample where all speech recording were of the same language may help address this concern to determine the models reliance on spoken language. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face.

== References ==
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In
IEEE International Conference on Computer Vision (ICCV),
2017.

[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019.