http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=T34sun&feedformat=atomstatwiki - User contributions [US]2024-03-29T04:42:57ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Research_Papers_Classification_System&diff=48939Research Papers Classification System2020-12-02T22:20:23Z<p>T34sun: /* Paper Classification Using K-means Clustering */</p>
<hr />
<div>= Presented by =<br />
Jill Wang, Junyi (Jay) Yang, Yu Min (Chris) Wu, Chun Kit (Calvin) Li<br />
<br />
= Introduction =<br />
This paper introduces a paper classification system that utilizes the Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and K-means clustering. The most important technology the system used to process big data is the Hadoop Distributed File System (HDFS). The system can handle quantitatively complex research paper classification problems efficiently and accurately.<br />
<br />
===General Framework===<br />
<br />
The paper classification system classifies research papers based on the abstracts given that the core of most papers is presented in the abstracts. <br />
<br />
<ol><li>Paper Crawling <br />
<p>Collects abstracts from research papers published during a given period</p></li><br />
<li>Preprocessing<br />
<p> <ol style="list-style-type:lower-alpha"><li>Removes stop words in the papers crawled, in which only nouns are extracted from the papers</li><br />
<li>generates a keyword dictionary, keeping only the top-N keywords with the highest frequencies</li> </ol><br />
</p></li> <br />
<li>Topic Modelling<br />
<p> Use the LDA to group the keywords into topics</p><br />
</li><br />
<li>Paper Length Calculation<br />
<p> Calculates the total number of occurrences of words to prevent an unbalanced TF values caused by the various length of abstracts using the map-reduce algorithm</p><br />
</li><br />
<li>Word Frequency Calculation<br />
<p> Calculates the Term Frequency (TF) values which represent the frequency of keywords in a research paper</p><br />
</li><br />
<li>Document Frequency Calculation<br />
<p> Calculates the Document Frequency (DF) values which represents the frequency of keywords in a collection of research papers. The higher the DF value, the lower the importance of a keyword.</p><br />
</li><br />
<li>TF-IDF calculation<br />
<p> Calculates the inverse of the DF which represents the importance of a keyword.</p><br />
</li><br />
<li>Paper Classification<br />
<p> Classify papers by topics using the K-means clustering algorithm.</p><br />
</li><br />
</ol><br />
<br />
===Technologies===<br />
<br />
The HDFS with a Hadoop cluster composed of one master node, one sub node, and four data nodes is what is used to process the massive paper data. Hadoop-2.6.5 version in Java is what is used to perform the TF-IDF calculation. Spark MLlib is what is used to perform the LDA. The Scikit-learn library is what is used to perform the K-means clustering.<br />
<br />
===HDFS===<br />
<br />
Hadoop Distributed File System was used to process big data in this system. What Hadoop does is to break a big collection of data into different partitions and pass each partition to one individual processor. Each processor will only have information about the partition of data it received.<br />
<br />
'''In this summary, we are going to focus on introducing the main algorithms of what this system uses, namely LDA, TF-IDF, and K-Means.'''<br />
<br />
=Data Preprocessing=<br />
===Crawling of Abstract Data===<br />
<br />
Under the assumption that audiences tend to first read the abstract of a paper to gain an overall understanding of the material, it is reasonable to assume the abstract section includes “core words” that can be used to effectively classify a paper's subject.<br />
<br />
An abstract is crawled to have its stop words removed. Stop words are words that are usually ignored by search engines, such as “the”, “a”, and etc. Afterwards, nouns are extracted, as a more condensed representation for efficient analysis.<br />
<br />
This is managed on HDFS. The TF-IDF value of each paper is calculated through map-reduce.<br />
<br />
===Managing Paper Data===<br />
<br />
To construct an effective keyword dictionary using abstract data and keywords data in all of the crawled papers, the authors categorized keywords with similar meanings using a single representative keyword. The approach is called stemming, which is common in cleaning data. 1394 keyword categories are extracted, which is still too much to compute. Hence, only the top 30 keyword categories are used.<br />
<br />
<div align="center">[[File:table_1_kswf.JPG|700px]]</div><br />
<br />
=Topic Modeling Using LDA=<br />
<br />
Latent Dirichlet allocation (LDA) is a generative probabilistic model that views documents as random mixtures over latent topics. Each topic is a distribution over words, and the goal is to extract these topics from documents.<br />
<br />
LDA estimates the topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> using Dirichlet priors for the distributions with a fixed number of topics. For each document, obtain a feature vector:<br />
<br />
\[F = \left( P\left(z_1 | d\right), P\left(z_2 | d\right), \cdots, P\left(z_k | d\right) \right)\]<br />
<br />
In the paper, authors extract topics from preprocessed paper to generate three kinds of topic sets, each with 10, 20, and 30 topics respectively. The following is a table of the 10 topic sets of highest frequency keywords.<br />
<br />
<div align="center">[[File:table_2_tswtebls.JPG|700px]]</div><br />
<br />
<br />
===LDA Intuition===<br />
<br />
LDA uses the Dirichlet priors of the Dirichlet distribution. The following picture illustrates 2-simplex Dirichlet distributions with different alpha values, one for each corner of the triangles. <br />
<br />
<div align="center">[[File:dirichlet_dist.png|700px]]</div><br />
<br />
Simplex is a generalization of the notion of a triangle. In Dirichlet distribution, each parameter will be represented by a corner in simplex, so adding additional parameters implies increasing the dimensions of simplex. As illustrated, when alphas are smaller than 1 the distribution is dense at the corners. When the alphas are greater than 1 the distribution is dense at the centers.<br />
<br />
The following illustration shows an example LDA with 3 topics, 4 words and 7 documents.<br />
<br />
<div align="center">[[File:LDA_example.png|800px]]</div><br />
<br />
In the left diagram, there are three topics, hence it is a 2-simplex. In the right diagram there are four words, hence it is a 3-simplex. LDA essentially adjusts parameters in Dirichlet distributions and multinomial distributions (represented by the points), such that, in the left diagram, all the yellow points representing documents and, in the right diagram, all the points representing topics, are as close to a corner as possible. In other words, LDA finds topics for documents and also finds words for topics. At the end topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> are produced.<br />
<br />
=Term Frequency Inverse Document Frequency (TF-IDF) Calculation=<br />
<br />
TF-IDF is widely used to evaluate the importance of a set of words in the fields of information retrieval and text mining. It is a combination of term frequency (TF) and inverse document frequency (IDF). The idea behind this combination is<br />
* It evaluates the importance of a word within a document, and<br />
* It evaluates the importance of the word among the collection of all documents<br />
<br />
The TF-IDF formula has the following form:<br />
<br />
\[TF-IDF_{i,j} = TF_{i,j} \times IDF_{i}\]<br />
<br />
where i stands for the <math>i^{th}</math> word and j stands for the <math>j^{th}</math> document.<br />
<br />
===Term Frequency (TF)===<br />
<br />
TF evaluates the percentage of a given word in a document. Thus, TF value indicates the importance of a word. The TF has a positive relation with the importance.<br />
<br />
In this paper, we only calculate TF for words in the keyword dictionary obtained. For a given keyword i, <math>TF_{i,j}</math> is the number of times word i appears in document j divided by the total number of words in document j.<br />
<br />
The formula for TF has the following form:<br />
<br />
\[TF_{i,j} = \frac{n_{i,j} }{\sum_k n_{k,j} }\]<br />
<br />
where i stands for the <math>i^{th}</math> word, j stands for the <math>j^{th}</math> document, <math>n_{i,j}</math> stands for the number of times words <math>t_i</math> appear in document <math>d_j</math> and <math>\sum_k n_{k,j} </math> stands for total number of occurence of words in document <math>d_j</math>.<br />
<br />
Note that the denominator is the total number of words remaining in document j after crawling.<br />
<br />
===Document Frequency (DF)===<br />
<br />
DF evaluates the percentage of documents that contain a given word over the entire collection of documents. Thus, the higher DF value is, the less important the word is.<br />
<br />
<math>DF_{i}</math> is the number of documents in the collection with word i divided by the total number of documents in the collection. The formula for DF has the following form:<br />
<br />
\[DF_{i} = \frac{|d_k \in D: n_{i,k} > 0|}{|D|}\]<br />
<br />
where <math>n_{i,k}</math> is the number of times word i appears in document k, |D| is the total number of documents in the collection.<br />
<br />
Since DF and the importance of the word have an inverse relation, we use inverse document frequency (IDF) instead of DF.<br />
<br />
===Inverse Document Frequency (IDF)===<br />
<br />
In this paper, IDF is calculated in a log scale. Since we will receive a large number of documents, i.e, we will have a large |D|<br />
<br />
The formula for IDF has the following form:<br />
<br />
\[IDF_{i} = log\left(\frac{|D|}{|\{d_k \in D: n_{i,k} > 0\}|}\right)\]<br />
<br />
As mentioned before, we will use HDFS. The actual formula applied is:<br />
<br />
\[IDF_{i} = log\left(\frac{|D|+1}{|\{d_k \in D: n_{i,k} > 0\}|+1}\right)\]<br />
<br />
The inverse document frequency gives a measure of how rare a certain term is in a given document corpus.<br />
<br />
=Paper Classification Using K-means Clustering=<br />
<br />
The K-means clustering is an unsupervised classification algorithm that groups similar data into the same class. It is an efficient and simple method that can work with different types of data attributes and is able to handle noise and outliers.<br />
<br><br />
<br />
Given a set of <math>d</math> by <math>n</math> dataset <math>\mathbf{X} = \left[ \mathbf{x}_1 \cdots \mathbf{x}_n \right]</math>, the algorithm will assign each <math>\mathbf{x}_j</math> into <math>k</math> different clusters based on the characteristics of <math>\mathbf{x}_j</math> itself.<br />
<br><br />
<br />
Moreover, when assigning data into a cluster, the algorithm will also try to minimise the distances between the data and the centre of the cluster which the data belongs to. That is, k-means clustering will minimize the sum of square error:<br />
<br />
\begin{align*}<br />
min \sum_{i=1}^{k} \sum_{j \in C_i} ||x_j - \mu_i||^2<br />
\end{align*}<br />
<br />
where<br />
<ul><br />
<li><math>k</math>: the number of clusters</li><br />
<li><math>C_i</math>: the <math>i^th</math> cluster</li><br />
<li><math>x_j</math>: the <math>j^th</math> data in the <math>C_i</math></li><br />
<li><math>mu_i</math>: the centroid of <math>C_i</math></li><br />
<li><math>||x_j - \mu_i||^2</math>: the Euclidean distance between <math>x_j</math> and <math>\mu_i</math></li><br />
</ul><br />
<br><br />
<br />
Since the goal for this paper is to classify research papers and group papers with similar topics based on keywords, the paper uses the K-means clustering algorithm. The algorithm first computes the cluster centroid for each group of papers with a specific topic. Then, it will assign a paper into a cluster based on the Euclidean distance between the cluster centroid and the paper’s TF-IDF value.<br />
<br><br />
<br />
However, different values of <math>k</math> (the number of clusters) will return different clustering results. Therefore, it is important to define the number of clusters before clustering. For example, in this paper, the authors choose to use the Elbow scheme to determine the value of <math>k</math>. The Elbow scheme is a somewhat subjective way of choosing an optimal <math>k</math> that involves plotting the average of the squared distances from the cluster centers of the respective clusters (distortion) as a function of <math>k</math> and choosing a <math>k</math> at which point the decrease in distortion is outweighed by the increase in complexity. Also, to measure the performance of clustering, the authors decide to use the Silhouette scheme. The results of clustering are validated if the Silhouette scheme returns a value greater than <math>0.5</math>.<br />
<br />
=System Testing Results=<br />
<br />
In this paper, the dataset has 3264 research papers from the Future Generation Computer System (FGCS) journal between 1984 and 2017. For constructing keyword dictionaries for each paper, the authors have introduced three methods as shown below:<br />
<br />
<div align="center">[[File:table_3_tmtckd.JPG|700px]]</div><br />
<br />
<br />
Then, the authors use the Elbow scheme to define the number of clusters for each method with different numbers of keywords before running the K-means clustering algorithm. The results are shown below:<br />
<br />
<div align="center">[[File:table_4_nocobes.JPG|700px]]</div><br />
<br />
According to Table 4, there is a positive correlation between the number of keywords and the number of clusters. In addition, method 3 combines the advantages for both method 1 and method 2; thus, method 3 requires the least clusters in total. On the other hand, the wrong keywords might be presented in papers; hence, it might not be possible to group papers with similar subjects correctly by using method 1 and so method 1 needs the most number of clusters in total.<br />
<br />
<br />
Next, the Silhouette scheme had been used for measuring the performance for clustering. The average of the Silhouette values for each method with different numbers of keywords are shown below:<br />
<br />
<div align="center">[[File:table_5_asv.JPG|700px]]</div><br />
<br />
Since the clustering is validated if the Silhouette’s value is greater than 0.5, for methods with 10 and 30 keywords, the K-means clustering algorithm produces good results.<br />
<br />
<br />
To evaluate the accuracy of the classification system in this paper, the authors use the F-Score. The authors execute 5 times of experiment and use 500 randomly selected research papers for each trial. The following histogram shows the average value of F-Score for the three methods and different numbers of keywords:<br />
<br />
<div align="center">[[File:fig_16_fsvotm.JPG|700px]]</div><br />
<br />
Note that “TFIDF” means method 1, “LDA” means method 2, and “TFIDF-LDA” means method 3. The number 10, 20, and 30 after each method is the number of keywords the method has used.<br />
According to the histogram above, method 3 has the highest F-Score values than the other two methods with different numbers of keywords. Therefore, the classification system is most accurate when using method 3 as it combines the advantages for both method 1 and method 2.<br />
<br />
=Conclusion=<br />
<br />
This paper introduces a classification system that classifies research papers into different topics by using TF-IDF and LDA scheme with K-means clustering algorithm. This system allows users to search the papers they want quickly and with the most productivity.<br />
<br />
Furthermore, this classification system might be also used in different types of texts (e.g. documents, tweets, etc.) instead of only classifying research papers.<br />
<br />
=Critique=<br />
<br />
In this paper, DF values are calculated within each partition. This results that for each partition, DF value for a given word will vary and may have an inconsistent result for different partition methods. As mentioned above, there might be a divide by zero problem since some partitions do not have documents containing a given word, but this can be solved by introducing a dummy document as the authors did. Another method that might be better at solving inconsistent results and the divide by zero problems is to have all partitions to communicate with their DF value. Then pass the merged DF value to all partitions to do the final IDF and TF-IDF value. Having all partitions to communicate with the DF value will guarantee a consistent DF value across all partitions and helps avoid a divide by zero problem as words in the keyword dictionary must appear in some documents in the whole collection.<br />
<br />
This paper treated the words in the different parts of a document equivalently, it might perform better if it gives different weights to the same word in different parts. For example, if a word appears in the title of the document, it usually shows it's a main topic of this document so we can put more weight on it to categorize.<br />
<br />
When discussing the potential processing advantages of this classification system for other types of text samples, has the effect of processing mixed samples (text and image or text and video) taken into consideration? IF not, in terms of text classification only, does it have an overwhelming advantage over traditional classification models?<br />
<br />
The preprocessing should also include <math>n</math>-gram tokenization for topic modelling because some topics are inherently two words, such as machine learning where if it is seen separately, it implies different topics.<br />
<br />
This system is very compute-intensive due to the large volumes of dictionaries that can be generated by processing large volumes of data. It would be nice to see how much data HDFS had to process and similarly how much time was saved by using Hadoop for data processing as opposed to centralized approach.<br />
<br />
This system can be improved further in terms of computation times by utilizing other big data framework MapReduce, that can also use HDFS, by parallelizing their computation across multiple nodes for K-means clustering as discussed in (Jin, et al) [5].<br />
<br />
It's not exactly clear what method 3 (TFIDF-LDA) is doing, how is it performing TF-IDF on the topics? Also it seems like the preprocessing step only keeps 10/20/30 top words? This seems like an extremely low number especially in comparison with the LDA which has 10/20/30 topics - what is the reason for so strongly limiting the number of words? It would also be interesting to see if both key words and topics are necessary - an ablation study showing the significance of both would be interesting.<br />
<br />
=References=<br />
<br />
Blei DM, el. (2003). Latent Dirichlet allocation. J Mach Learn Res 3:993–1022<br />
<br />
Gil, JM, Kim, SW. (2019). Research paper classification systems based on TF-IDF and LDA schemes. ''Human-centric Computing and Information Sciences'', 9, 30. https://doi.org/10.1186/s13673-019-0192-7<br />
<br />
Liu, S. (2019, January 11). Dirichlet distribution Motivating LDA. Retrieved November 2020, from https://towardsdatascience.com/dirichlet-distribution-a82ab942a879<br />
<br />
Serrano, L. (Director). (2020, March 18). Latent Dirichlet Allocation (Part 1 of 2) [Video file]. Retrieved 2020, from https://www.youtube.com/watch?v=T05t-SqKArY<br />
<br />
Jin, Cui, Yu. (2016). A New Parallelization Method for K-means. https://arxiv.org/ftp/arxiv/papers/1608/1608.06347.pdf</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=48932Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-12-02T21:59:53Z<p>T34sun: /* Introduction */</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach on detecting heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. For context, ConvNetQuake is a convolutional neural network, used by Perol, Gharbi, and Denolle [4], for Earthquake detection and location from a single waveform. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to healthcare innovation, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo and Attarodi [2] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [3] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [1] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate, and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers; this addition helps to combat overfitting. A second modification that was made was to introduce the use of label smoothing, which can help by discouraging the model from making overconfident predictions. Label smoothing refers to the method of relaxing the confidence on the model's prediction labels. The authors' experiments demonstrated that both of these modifications helped to increase model accuracy. <br />
<br />
During the training stage, a 10-second long two-channel input was fed into the neural network. In order to ensure that the two channels were weighted equally, both channels were normalized. Besides, time invariance was incorporated by selecting the 10-second long segment randomly from the entire signal. <br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
The following images gives a visual representation of the model.<br />
<br />
[[File:architecture.png | thumb | center | 1000px | Model Architecture (Gupta et al., 2019)]]<br />
<br />
==Results== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting in the highest individual accuracies: v5, v6, vx, vz, and ii. The researchers further investigated the accuracies for pairs of the top 5 highest individual channels using 20-fold cross-validation. They arrived at the conclusion of highest pairs accuracies to fed into a neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors affecting model performance evaluation: 1) Random train-val-test split might have effects on the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2) random initialization of the weights of the neural network shows little effects on the performance of the model performance evaluation, because of showing high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in the other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records affecting the training accuracy, they used 290 fold patient-wise split, resulting in the same highest accuracy of the pair v6 and vz same as record-wise split. The second best pair was ii and vz, which also contains the vz channel. Combining the two best pair channels into v6, vz, vii ultimately gave the best results over 10 trials which has an average of 97.83% in patient-wise split. Even though the patient-wise split might result in lower accuracy evaluation, however, it still maintains a very high average.<br />
<br />
==Conclusion & Discussion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to a larger labeled data set is needed. The PTB database is small. It is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
Fourier Transform(such as FFT) can be helpful when dealing with ECG signals. It transforms signals from time domain to frequency domain, which means some hidden features in frequency may be discovered.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.<br />
<br />
- I feel that the results of deep learning models in medical settings where the consequences of misclassification can be severe should be evaluated by assigning weights to classification. In case if the misclassification can lead to severe consequences, then the network should be trained in such a way that it errs towards safety. For example, in case if heart disease, the consequences will be very high if the system says that there is no heart disease when in fact there is. So, the evaluation metric must be selected carefully.<br />
<br />
- This is a useful and meaningful application topic in machine learning. Using Deep Learning to detect heart disease can be very helpful if it is difficult to detect disease by looking at ECG by humans eys. This model also useful for doing statistics, such as calculating the percentage of people get heart disease. But I think the doctor should not 100% trust the result from the model, it is almost impossible to get 100% accuracy from a model. So, I think double-checking by human eyes is necessary if the result is weird. What is more, I think it will be interesting to discuss more applications in mediccal by using this method, such as detecting the Brainwave diagram to predict a person's mood and to diagnose mental diseases.<br />
<br />
- Compared to the dataset for other topics such as object recognition, the PTB database is pretty small with only 549 ECG records. And these are highly unbiased (Table 1) with 4 records for myocarditis and 148 for myocardial infarction. Medical datasets can only be labeled by specialists. This is why these datasets are related small. It would be great if there will be a larger, more comprehensive dataset.<br />
<br />
- Only results using 20-fold cross validation were presented. It should be shown that the results could be reproduced using a more common number of folds like 5 or 10<br />
<br />
- There are potential issues with the inclusion of Frank leads. From a practitioner standpoint, ECGs taken with Frank leads are less common. This could prevent the use of this technique. Additionally, Frank leads are expressible as a linear combinations of the 12 traditional leads. The authors are not adding any fundamentally new information by including them and their inclusion could be viewed as a form of feature selection (going against the authors' original intentions).<br />
<br />
- It will better if we can see how the model in this paper outperformed those methods that used feature selections. The details of the results are not enough.<br />
<br />
- A new extended dataset for PTB dubbed [https://www.nature.com/articles/s41597-020-0495-6 PTB-XL], has 21837 records. Using this dataset could yield a more accurate result, since the original PTB's small dataset posed limitations on the deep learning model.<br />
<br />
- The paper mentions that it has better results, but by how much? what accuracy did the methods you compared to have? Also, what methods did you compare to? (Authors mentioned feature engineering methods but this is vague) Also how much were the labels smoothed? (i.e. 1 -> 0.99 or 1-> 0.95 for example) How much of a difference did the label smoothing make?<br />
<br />
== References ==<br />
<br />
[1] Na Liu et al. "A Simple and Effective Method for Detecting Myocardial Infarction Based on Deep Convolutional Neural Network". In: Journal of Medical Imaging and Health Informatics (Sept. 2018). doi: 10.1166/jmihi.2018.2463.<br />
<br />
[2] Naser Safdarian, N.J. Dabanloo, and Gholamreza Attarodi. "A New Pattern Recognition Method for Detection and Localization of Myocardial Infarction Using T-Wave Integral and Total Integral as Extracted Features from One Cycle of ECG Signal". In: J. Biomedical Science and Engineering (Aug. 2014). doi: http://dx.doi.org/10.4236/jbise.2014.710081.<br />
<br />
[3] L.D. Sharma and R.K. Sunkaria. "Inferior myocardial infarction detection using stationary wavelet transform and machine learning approach." In: Signal, Image and Video Processing (July 2017). doi: https://doi.org/10.1007/s11760-017-1146-z.<br />
<br />
[4] Perol Thibaut, Gharbi Michaël, and Denolle Marin. "Convolutional neural network for earthquake detection and location". In: Science Advances (Feb. 2018). doi: 10.1126/sciadv.1700578</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Loss_Function_Search_for_Face_Recognition&diff=48931Loss Function Search for Face Recognition2020-12-02T21:58:57Z<p>T34sun: /* Optimization */</p>
<hr />
<div>== Presented by ==<br />
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang<br />
<br />
== Introduction ==<br />
Face recognition is a technology that can label a face to a specific identity. The process involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face and another face map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness.<br />
<br />
Hence, the paper introduced a new loss function which can reduce the softmax probability. Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:<br />
<br />
<center><math>L_1=-log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center><br />
<br />
Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:<br />
<br />
<center><math>L_2=-log\frac{e^{s cos{(\theta_{{w_y},x})}}}{e^{s cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s cos{(\theta_{{w_y},x})}}}}</math> [1] </center><br />
<br />
This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, they build upon the same structure as the equation above. Some of these variations will be discussed in detail in the later sections. <br />
<br />
In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two design search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.<br />
<br />
== Previous Work ==<br />
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.<br />
<br />
== Motivation ==<br />
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions was high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed, and including margin-based formulations, they often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.<br />
<br />
To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one parameter required and an improved search space using a reward-based method which allows the authors to determine the best option for their loss function.<br />
<br />
== Problem Formulation ==<br />
=== Analysis of Margin-based Softmax Loss ===<br />
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:<br />
<br />
<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center><br />
<center> where <math>a=1-e^{s{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center><br />
<br />
<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.<br />
<br />
Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability.<br />
<br />
=== Random Search ===<br />
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:<br />
<br />
<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center><br />
<br />
This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.<br />
<br />
=== Reward-Guided Search ===<br />
Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is: <br />
<br />
<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center><br />
<br />
where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.<br />
<br />
<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment's reaction to this action.<br />
<br />
<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center><br />
<center>Figure 1: Reinforcement Learning scenario [4]</center><br />
<br />
The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5]. <br />
<br />
In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. At each epoch, <math>B</math> hyper-parameters <math>{a_1, a_2, ..., a_B }</math> are sampled as <math>a \sim \mathcal{N}(\mu, \sigma)</math>. In each epoch, <math>B</math> models are generated with rewards <math>R(a_i), i \in [1, B]</math>. <math>\mu</math> updates after each epoch from the reward function. <br />
<br />
<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center><br />
<br />
=== Optimization ===<br />
Calculating the reward involves a standard bi-level optimization problem. A standard bi-level optimization problem is a hierarchy of two optimization tasks, an upper-level or leader and lower-level or follower problems, which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:<br />
<br />
<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center><br />
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center><br />
<br />
In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.<br />
<br />
== Results and Discussion ==<br />
=== Data Preprocessing ===<br />
The training datasets consisted of cleaned versions of CASIA-WebFace and MS-Celeb-1M-v1c to remove the impact of noisy labels in the original sets. Furthermore, there were a total of 15,414 identities that overlapped between the testing and training datasets. These were removed from the training sets.<br />
<br />
=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===<br />
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms, however, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.<br />
<br />
Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boost the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discimination power. Also the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation. The following table gives a summary of the performance of each model.<br />
<br />
<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<br />
<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on RFW ===<br />
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3. <br />
<br />
<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on MegaFace and Trillion-Pairs ===<br />
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e-3</math> on MegaFace, the identification TPR@FAR = <math>1e-6</math> and the verification TPR@FAR = <math>1e-9</math> on Trillion-Pairs are reported on Table 4 and 5.<br />
<br />
On the test sets MegaFace and Trillion-Pairs, Search-softmax achieves the best performance over all other alternative methods. On MegaFace, Search-softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to new designed search space. <br />
<br />
<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center><br />
<br />
From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a same trend on Trillion-Pairs where Search-softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.<br />
<br />
<center>[[Image:G25_Figure2_left.png|800px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|800px |alt=Alt text|Title text]]</center><br />
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center><br />
<br />
== Conclusion ==<br />
In this paper, it is discussed that in order to enhance feature discrimination for face recognition, it is key to know how to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets. While these developed methods were generally more effective in increasing accuracy versus previous methods, there is very little difference between the two. It can be seen that Search-Softmax performs slightly better than Random-Softmax most of the time.<br />
<br />
== Critiques ==<br />
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.<br />
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.<br />
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.<br />
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.<br />
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.<br />
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.<br />
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper], written by Tae-Hyun Oh et al., they also discuss an optimal loss function for face recognition. However, since in the other paper, they were doing face recognition from voice audio, the loss function used was slightly different than the ones discussed in this paper.<br />
* This model has many applications such as identifying disguised prisoners for police. But we need to do a good data preprocessing otherwise we might not get a good predicted result. But authors did not mention about the data preprocessing which is a key part of this model.<br />
* It will be better if we can know what kind of noises was removed in the clean version. Also, simply removing the overlapping data is wasteful. It would be better to just put them into one of the train and test samples.<br />
* This paper indicate that the new searching method and loss function have induced more effective face recognition result than other six methods. But there is no mention of the increase or decrease in computational efficiency since only very little difference exist between those methods and the real time evaluation is often required at the face recognition application level.<br />
* There are some loss functions that receives more than 2 inputs. For example, the ''triplet loss'' function, developed by Google, takes 3 inputs: positive input, negative input and anchor input. This makes sense because for face recognition, we want to model to learn not only what it is supposed to predict but also what it is not supposed to predict. Typically, triplet loss handles false positives much better. This paper can extend its scope to such loss function that takes more than 2 inputs.<br />
* It would be good to also know what the training time is like for the method, specifically the "Reward-Guided Search" which uses RL. Also the authors mention some data preprocessing that was performed, was this same preprocessing also performed for the methods they compared against?<br />
<br />
== References ==<br />
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.<br />
<br />
[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.<br />
2020].<br />
<br />
[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020]. <br />
<br />
[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].<br />
<br />
[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Loss_Function_Search_for_Face_Recognition&diff=48930Loss Function Search for Face Recognition2020-12-02T21:58:24Z<p>T34sun: /* Optimization */</p>
<hr />
<div>== Presented by ==<br />
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang<br />
<br />
== Introduction ==<br />
Face recognition is a technology that can label a face to a specific identity. The process involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face and another face map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness.<br />
<br />
Hence, the paper introduced a new loss function which can reduce the softmax probability. Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:<br />
<br />
<center><math>L_1=-log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center><br />
<br />
Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:<br />
<br />
<center><math>L_2=-log\frac{e^{s cos{(\theta_{{w_y},x})}}}{e^{s cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s cos{(\theta_{{w_y},x})}}}}</math> [1] </center><br />
<br />
This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, they build upon the same structure as the equation above. Some of these variations will be discussed in detail in the later sections. <br />
<br />
In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two design search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.<br />
<br />
== Previous Work ==<br />
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.<br />
<br />
== Motivation ==<br />
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions was high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed, and including margin-based formulations, they often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.<br />
<br />
To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one parameter required and an improved search space using a reward-based method which allows the authors to determine the best option for their loss function.<br />
<br />
== Problem Formulation ==<br />
=== Analysis of Margin-based Softmax Loss ===<br />
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:<br />
<br />
<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center><br />
<center> where <math>a=1-e^{s{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center><br />
<br />
<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.<br />
<br />
Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability.<br />
<br />
=== Random Search ===<br />
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:<br />
<br />
<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center><br />
<br />
This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.<br />
<br />
=== Reward-Guided Search ===<br />
Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is: <br />
<br />
<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center><br />
<br />
where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.<br />
<br />
<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment's reaction to this action.<br />
<br />
<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center><br />
<center>Figure 1: Reinforcement Learning scenario [4]</center><br />
<br />
The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5]. <br />
<br />
In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. At each epoch, <math>B</math> hyper-parameters <math>{a_1, a_2, ..., a_B }</math> are sampled as <math>a \sim \mathcal{N}(\mu, \sigma)</math>. In each epoch, <math>B</math> models are generated with rewards <math>R(a_i), i \in [1, B]</math>. <math>\mu</math> updates after each epoch from the reward function. <br />
<br />
<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center><br />
<br />
=== Optimization ===<br />
Calculating the reward involves a standard bi-level optimization problem(a standard bi-level optimization problem is a hierarchy of two optimization tasks, an upper-level or leader and lower-level or follower problems), which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:<br />
<br />
<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center><br />
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center><br />
<br />
In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.<br />
<br />
== Results and Discussion ==<br />
=== Data Preprocessing ===<br />
The training datasets consisted of cleaned versions of CASIA-WebFace and MS-Celeb-1M-v1c to remove the impact of noisy labels in the original sets. Furthermore, there were a total of 15,414 identities that overlapped between the testing and training datasets. These were removed from the training sets.<br />
<br />
=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===<br />
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms, however, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.<br />
<br />
Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boost the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discimination power. Also the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation. The following table gives a summary of the performance of each model.<br />
<br />
<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<br />
<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on RFW ===<br />
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3. <br />
<br />
<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on MegaFace and Trillion-Pairs ===<br />
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e-3</math> on MegaFace, the identification TPR@FAR = <math>1e-6</math> and the verification TPR@FAR = <math>1e-9</math> on Trillion-Pairs are reported on Table 4 and 5.<br />
<br />
On the test sets MegaFace and Trillion-Pairs, Search-softmax achieves the best performance over all other alternative methods. On MegaFace, Search-softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to new designed search space. <br />
<br />
<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center><br />
<br />
From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a same trend on Trillion-Pairs where Search-softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.<br />
<br />
<center>[[Image:G25_Figure2_left.png|800px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|800px |alt=Alt text|Title text]]</center><br />
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center><br />
<br />
== Conclusion ==<br />
In this paper, it is discussed that in order to enhance feature discrimination for face recognition, it is key to know how to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets. While these developed methods were generally more effective in increasing accuracy versus previous methods, there is very little difference between the two. It can be seen that Search-Softmax performs slightly better than Random-Softmax most of the time.<br />
<br />
== Critiques ==<br />
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.<br />
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.<br />
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.<br />
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.<br />
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.<br />
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.<br />
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper], written by Tae-Hyun Oh et al., they also discuss an optimal loss function for face recognition. However, since in the other paper, they were doing face recognition from voice audio, the loss function used was slightly different than the ones discussed in this paper.<br />
* This model has many applications such as identifying disguised prisoners for police. But we need to do a good data preprocessing otherwise we might not get a good predicted result. But authors did not mention about the data preprocessing which is a key part of this model.<br />
* It will be better if we can know what kind of noises was removed in the clean version. Also, simply removing the overlapping data is wasteful. It would be better to just put them into one of the train and test samples.<br />
* This paper indicate that the new searching method and loss function have induced more effective face recognition result than other six methods. But there is no mention of the increase or decrease in computational efficiency since only very little difference exist between those methods and the real time evaluation is often required at the face recognition application level.<br />
* There are some loss functions that receives more than 2 inputs. For example, the ''triplet loss'' function, developed by Google, takes 3 inputs: positive input, negative input and anchor input. This makes sense because for face recognition, we want to model to learn not only what it is supposed to predict but also what it is not supposed to predict. Typically, triplet loss handles false positives much better. This paper can extend its scope to such loss function that takes more than 2 inputs.<br />
* It would be good to also know what the training time is like for the method, specifically the "Reward-Guided Search" which uses RL. Also the authors mention some data preprocessing that was performed, was this same preprocessing also performed for the methods they compared against?<br />
<br />
== References ==<br />
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.<br />
<br />
[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.<br />
2020].<br />
<br />
[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020]. <br />
<br />
[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].<br />
<br />
[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Being_Bayesian_about_Categorical_Probability&diff=48927Being Bayesian about Categorical Probability2020-12-02T21:54:42Z<p>T34sun: /* Related Work */</p>
<hr />
<div>== Presented By ==<br />
Evan Li, Jason Pu, Karam Abuaisha, Nicholas Vadivelu<br />
<br />
== Introduction ==<br />
<br />
Since the outputs of neural networks are not probabilities, Softmax (Bridle, 1990) is a staple for neural network’s performing classification--it exponentiates each logit then normalizes by the sum, giving a distribution over the target classes. However, networks with softmax outputs give no information about uncertainty (Blundell et al., 2015; Gal & Ghahramani, 2016), and the resulting distribution over classes is poorly calibrated (Guo et al., 2017), often giving overconfident predictions even when the classification is wrong. In addition, softmax also raises concerns about overfitting NNs due to its confident predictive behaviors(Xie et al., 2016; Pereyra et al., 2017). To achieve better generalization performance, this may require some effective regularization techniques. <br />
<br />
Bayesian Neural Networks (BNNs; MacKay, 1992) can alleviate these issues, but the resulting posteriors over the parameters are often intractable. Approximations such as variational inference (Graves, 2011; Blundell et al., 2015) and Monte Carlo Dropout (Gal & Ghahramani, 2016) can still be expensive or give poor estimates for the posteriors. This work proposes a Bayesian treatment of the output logits of the neural network, treating the targets as a categorical random variable instead of a fixed label. This gives a computationally cheap way to get well-calibrated uncertainty estimates on neural network classifications.<br />
<br />
== Related Work ==<br />
<br />
Using Bayesian Neural Networks is the dominant way of applying Bayesian techniques to neural networks. Many techniques have been developed to make posterior approximation more accurate and scalable, despite these, BNNs do not scale to the state of the art techniques or large data sets. There are techniques to explicitly avoid modeling the full weight posterior that are more scalable, such as with Monte Carlo Dropout (Gal & Ghahramani, 2016) or tracking mean/covariance of the posterior during training (Mandt et al., 2017; Zhang et al., 2018; Maddox et al., 2019; Osawa et al., 2019). Non-Bayesian uncertainty estimation techniques such as deep ensembles (Lakshminarayanan et al., 2017) and temperature scaling (Guo et al., 2017; Neumann et al., 2018).<br />
<br />
== Preliminaries ==<br />
=== Definitions ===<br />
Let's formalize our classification problem and define some notations for the rest of this summary:<br />
<br />
::Dataset:<br />
$$ \mathcal D = \{(x_i,y_i)\} \in (\mathcal X \times \mathcal Y)^N $$<br />
::General classification model<br />
$$ f^W: \mathcal X \to \mathbb R^K $$<br />
::Softmax function: <br />
$$ \phi(x): \mathbb R^K \to [0,1]^K \;\;|\;\; \phi_k(X) = \frac{\exp(f_k^W(x))}{\sum_{k \in K} \exp(f_k^W(x))} $$<br />
::Softmax activated NN:<br />
$$ \phi \;\circ\; f^W: \chi \to \Delta^{K-1} $$<br />
::NN as a true classifier:<br />
$$ arg\max_i \;\circ\; \phi_i \;\circ\; f^W \;:\; \mathcal X \to \mathcal Y $$<br />
<br />
We'll also define the '''count function''' - a <math>K</math>-vector valued function that outputs the occurences of each class coincident with <math>x</math>:<br />
$$ c^{\mathcal D}(x) = \sum_{(x',y') \in \mathcal D} \mathbb y' I(x' = x) $$<br />
<br />
=== Classification With a Neural Network ===<br />
A typical loss function used in classification is cross-entropy. It's well known that optimizing <math>f^W</math> for <math>l_{CE}</math> is equivalent to optimizing for <math>l_{KL}</math>, the <math>KL</math> divergence between the true distribution and the distribution modeled by NN, that is:<br />
$$ l_{KL}(W) = KL(\text{true distribution} \;|\; \text{distribution encoded by }NN(W)) $$<br />
Let's introduce notations for the underlying (true) distributions of our problem. Let <math>(x_0,y_0) \sim (\mathcal X \times \mathcal Y)</math>:<br />
$$ \text{Full Distribution} = F(x,y) = P(x_0 = x,y_0 = y) $$<br />
$$ \text{Marginal Distribution} = P(x) = F(x_0 = x) $$<br />
$$ \text{Point Class Distribution} = P(y_0 = y \;|\; x_0 = x) = F_x(y) $$<br />
Then we have the following factorization:<br />
$$ F(x,y) = P(x,y) = P(y|x)P(x) = F_x(y)F(x) $$<br />
Substitute this into the definition of KL divergence:<br />
$$ = \sum_{(x,y) \in \mathcal X \times \mathcal Y} F(x,y) \log\left(\frac{F(x,y)}{\phi_y(f^W(x))}\right) $$<br />
$$ = \sum_{x \in \mathcal X} F(x) \sum_{y \in \mathcal Y} F(y|x) \log\left( \frac{F(y|x)}{\phi_y(f^W(x))} \right) $$<br />
$$ = \sum_{x \in \mathcal X} F(x) \sum_{y \in \mathcal Y} F_x(y) \log\left( \frac{F_x(y)}{\phi_y(f^W(x))} \right) $$<br />
$$ = \sum_{x \in \mathcal X} F(x) KL(F_x \;||\; \phi\left( f^W(x) \right)) $$<br />
As usual, we don't have an analytic form for <math>l</math> (if we did, this would imply we know <math>F_X</math> meaning we knew the distribution in the first place). Instead, estimate from <math>\mathcal D</math>:<br />
$$ F(x) \approx \hat F(x) = \frac{||c^{\mathcal D}(x)||_1}{N} $$<br />
$$ F_x(y) \approx \hat F_x(y) = \frac{c^{\mathcal D}(x)}{|| c^{\mathcal D}(x) ||_1}$$<br />
$$ \to l_{KL}(W) = \sum_{x \in \mathcal D} \frac{||c^{\mathcal D}(x)||_1}{N} KL \left( \frac{c^{\mathcal D}(x)}{||c^{\mathcal D}(x)||_1} \;||\; \phi(f^W(x)) \right) $$<br />
The approximations <math>\hat F, \hat F_X</math> are often not very good though: consider a typical classification such as MNIST, we would never expect two handwritten digits to produce the exact same image. Hence <math>c^{\mathcal D}(x)</math> is (almost) always going to have a single index 1 and the rest 0. This has implications for our approximations:<br />
$$ \hat F(x) \text{ is uniform for all } x \in \mathcal D $$<br />
$$ \hat F_x(y) \text{ is degenerate for all } x \in \mathcal D $$<br />
This clearly has implications for overfitting: to minimize the KL term in <math>l_{KL}(W)</math> we want <math>\phi(f^W(x))</math> to be very close to <math>\hat F_x(y)</math> at each point - this means that the loss function is in fact encouraging the neural network to output near degenerate distributions! One form of regularization to help this problem is called label smoothing. Instead of using the degenerate $F_x(y)$ as a target function, let's "smooth" it (by adding a scaled uniform distribution to it) so it's no longer degenerate:<br />
$$ F'_x(y) = (1-\lambda)\hat F_x(y) + \frac \lambda K \vec 1 $$<br />
<br />
== Method ==<br />
The main technical proposal of the paper is a Bayesian framework to estimate the (former) target distribution <math>F_x(y)</math>. That is, we construct a posterior distribution of <math> F_x(y) </math> and use that as our new target distribution. We call it the ''belief matching'' (BM) framework.<br />
<br />
=== Constructing Target Distribution ===<br />
Recall that <math>F_x(y)</math> is a k-categorical probability distribution - it's PMF can be fully characterized by k numbers that sum to 1. Hence we can encode any such <math>F_x</math> as a point in <math>\Delta^{k-1}</math>. We'll do exactly that - let's call this vecor <math>z</math>:<br />
$$ z \in \Delta^{k-1} $$<br />
$$ \text{prior} = p_{z|x}(z) $$<br />
$$ \text{conditional} = p_{y|z,x}(y) $$<br />
$$ \text{posterior} = p_{z|x,y}(z) $$<br />
Then if we perform inference:<br />
$$ p_{z|x,y}(z) \propto p_{z|x}(z)p_{y|z,x}(y) $$<br />
The distribution chosen to model prior was <math>dir_K(\beta)</math>:<br />
$$ p_{z|x}(z) = \frac{\Gamma(||\beta||_1)}{\prod_{k=1}^K \Gamma(\beta_k)} \prod_{k=1}^K z_k^{\beta_k - 1} $$<br />
Note that by definition of <math>z</math>: <math> p_{y|x,z} = z_y </math>. Since the Dirichlet is a conjugate prior to categorical distributions we have a convenient form for the mean of the posterior:<br />
$$ \bar{p_{z|x,y}}(z) = \frac{\beta + c^{\mathcal D}(x)}{||\beta + c^{\mathcal D}(x)||_1} \propto \beta + c^{\mathcal D}(x) $$<br />
This is in fact a generalization of (uniform) label smoothing (label smoothing is a special case where <math>\beta = \frac 1 K \vec{1} </math>).<br />
<br />
=== Representing Approximate Distribution ===<br />
Our new target distribution is <math>p_{z|x,y}(z)</math> (as opposed to <math>F_x(y)</math>). That is, we want to construct an interpretation of our neural network weights to construct a distribution with support in <math> \Delta^{K-1} </math> - the NN can then be trained so this encoded distribution closely approximates <math>p_{z|x,y}</math>. Let's denote the PMF of this encoded distribution <math>q_{z|x}^W</math>. This is how the BM framework defines it:<br />
$$ \alpha^W(x) := \exp(f^W(x)) $$<br />
$$ q_{z|x}^W(z) = \frac{\Gamma(||\alpha^W(x)||_1)}{\sum_{k=1}^K \Gamma(\alpha_k^W(x))} \prod_{k=1}^K z_{k}^{\alpha_k^W(x) - 1} $$<br />
$$ \to Z^W_x \sim dir(\alpha^W(x)) $$<br />
Apply <math>\log</math> then <math>\exp</math> to <math>q_{z|x}^W</math>:<br />
$$ q^W_{z|x}(z) \propto \exp \left( \sum_k (\alpha_k^W(x) \log(z_k)) - \sum_k \log(z_k) \right) $$<br />
$$ \propto -l_{CE}(\phi(f^W(x)),z) + \frac{K}{||\alpha^W(x)||}KL(\mathcal U_k \;||\; z) $$<br />
It can actually be shown that the mean of <math>Z_x^W</math> is identical to <math>\phi(f^W(x))</math> - in other words, if we output the mean of the encoded distribution of our neural network under the BM framework, it is theoretically identical to a traditional neural network.<br />
<br />
=== Distribution Matching ===<br />
<br />
We now need a way to fit our approximate distribution from our neural network <math>q_{\mathbf{z | x}}^{\mathbf{W}}</math> to our target distribution <math>p_{\mathbf{z|x},y}</math>. The authors achieve this by maximizing the evidence lower bound (ELBO):<br />
<br />
$$l_{EB}(\mathbf y, \alpha^{\mathbf W}(\mathbf x)) = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] - KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; p_{\mathbf{z|x}}) $$<br />
<br />
Each term can be computed analytically:<br />
<br />
$$\mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf W }} \left[\log z_y \right] = \psi(\alpha_y^{\mathbf W} ( \mathbf x )) - \psi(\alpha_0^{\mathbf W} ( \mathbf x )) $$<br />
<br />
Where <math>\psi(\cdot)</math> represents the digamma function (logarithmic derivative of gamma function). Intuitively, we maximize the probability of the correct label. For the KL term:<br />
<br />
$$KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; p_{\mathbf{z|x}}) = \log \frac{\Gamma(a_0^{\mathbf W}(\mathbf x)) \prod_k \Gamma(\beta_k)}{\prod_k \Gamma(\alpha_k^{\mathbf W}(x)) \Gamma (\beta_0)} + \sum_k (\alpha_k^{\mathbf W}(x)-\beta_k)(\psi(\alpha_k^{\mathbf W}(\mathbf x)) - \psi(\alpha_0^{\mathbf W}(\mathbf x)) $$<br />
<br />
In the first term, for intuition, we can ignore <math>\alpha_0</math> and <math>\beta_0</math> since those just calibrate the distributions. Otherwise, we want the ratio of the products to be as close to 1 as possible to minimize the KL. In the second term, we want to minimize the difference between each individual <math>\alpha_k</math> and <math>\beta_k</math>, scaled by the normalized output of the neural network. <br />
<br />
This loss function can be used as a drop-in replacement for the standard softmax cross-entropy, as it has an analytic form and the same time complexity as typical softmax-cross entropy with respect to the number of classes (<math>O(K)</math>).<br />
<br />
=== On Prior Distributions ===<br />
<br />
We must choose our concentration parameter, <math>\beta</math>, for our dirichlet prior. We see our prior essentially disappears as <math>\beta_0 \to 0</math> and becomes stronger as <math>\beta_0 \to \infty</math>. Thus, we want a small <math>\beta_0</math> so the posterior isn't dominated by the prior. But, the authors claim that a small <math>\beta_0</math> makes <math>\alpha_0^{\mathbf W}(\mathbf x)</math> small, which causes <math>\psi (\alpha_0^{\mathbf W}(\mathbf x))</math> to be large, which is problematic for gradient based optimization. In practice, many neural network techniques aim to make <math>\mathbb E [f^{\mathbf W} (\mathbf x)] \approx \mathbf 0</math> and thus <math>\mathbb E [\alpha^{\mathbf W} (\mathbf x)] \approx \mathbf 1</math>, which means making <math>\alpha_0^{\mathbf W}(\mathbf x)</math> small can be counterproductive.<br />
<br />
So, the authors set <math>\beta = \mathbf 1</math> and introduce a new hyperparameter <math>\lambda</math> which is multiplied with the KL term in the ELBO:<br />
<br />
$$l^\lambda_{EB}(\mathbf y, \alpha^{\mathbf W}(\mathbf x)) = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] - \lambda KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; \mathcal P^D (\mathbf 1)) $$<br />
<br />
This stabilizes the optimization, as we can tell from the gradients:<br />
<br />
$$\frac{\partial l_{E B}\left(\mathbf{y}, \alpha^{\mathbf W}(\mathbf{x})\right)}{\partial \alpha_{k}^{\mathbf W}(\mathbf {x})}=\left(\tilde{\mathbf{y}}_{k}-\left(\alpha_{k}^{\mathbf W}(\mathbf{x})-\beta_{k}\right)\right) \psi^{\prime}\left(\alpha_{k}^{\mathbf{W}}(\boldsymbol{x})\right)<br />
-\left(1-\left(\alpha_{0}^{\boldsymbol{W}}(\boldsymbol{x})-\beta_{0}\right)\right) \psi^{\prime}\left(\alpha_{0}^{\boldsymbol{W}}(\boldsymbol{x})\right)$$<br />
<br />
$$\frac{\partial l_{E B}^{\lambda}\left(\mathbf{y}, \alpha^{\mathbf{W}}(\mathbf{x})\right)}{\partial \alpha_{k}^{W}(\mathbf{x})}=\left(\tilde{\mathbf{y}}_{k}-\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})-\lambda\right)\right) \frac{\psi^{\prime}\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})\right)}{\psi^{\prime}\left(\tilde{\alpha}_{0}^{\mathbf W}(\mathbf{x})\right)}<br />
-\left(1-\left(\tilde{\alpha}_{0}^{W}(\mathbf{x})-\lambda K\right)\right)$$<br />
<br />
As we can see, the first expression is affected by the magnitude of <math>\alpha^{\boldsymbol{W}}(\boldsymbol{x})</math>, whereas the second expression is not due to the <math>\frac{\psi^{\prime}\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})\right)}{\psi^{\prime}\left(\tilde{\alpha}_{0}^{\mathbf W}(\mathbf{x})\right)}</math> ratio.<br />
<br />
== Experiments ==<br />
<br />
Throughout the experiments in this paper, the authors employ various models based on residual connections (He et al., 2016 [1]) which are the models used for benchmarking in practice. We will first demonstrate improvements provided by BM, then we will show versatility in other applications. For fairness of comparisons, all configurations in the reference implementation will be fixed. The only additions in the experiments are initial learning rate warm-up and gradient clipping which are extremely helpful for stable training of BM. <br />
<br />
=== Generalization performance === <br />
The paper compares the generalization performance of BM with softmax and MC dropout on CIFAR-10 and CIFAR-100 benchmarks.<br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_T1.png]]<br />
<br />
The next comparison was performed between BM and softmax on the ImageNet benchmark. <br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_T2.png]]<br />
<br />
For both datasets and In all configurations, BM achieves the best generalization and outperforms softmax and MC dropout.<br />
<br />
===== Regularization effect of prior =====<br />
<br />
In theory, BM has 2 regularization effects:<br />
The prior distribution, which smooths the target posterior<br />
Averaging all of the possible categorical probabilities to compute the distribution matching loss<br />
The authors perform an ablation study to examine the 2 effects separately - removing the KL term in the ELBO removes the effect of the prior distribution.<br />
For ResNet-50 on CIFAR-100 and CIFAR-10 the resulting test error rates were 24.69% and 5.68% respectively. <br />
<br />
This demonstrates that both regularization effects are significant since just having one of them improves the generalization performance compared to the softmax baseline, and having both improves the performance even more.<br />
<br />
===== Impact of <math>\beta</math> =====<br />
<br />
The effect of β on generalization performance is studied by training ResNet-18 on CIFAR-10 by tuning the value of β on its own, as well as jointly with λ. It was found that robust generalization performance is obtained for β ∈ [<math>e^{−1}, e^4</math>] when tuning β on its own; and β ∈ [<math>e^{−4}, e^{8}</math>] when tuning β jointly with λ. The figure below shows a plot of the error rate with varying β.<br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_F3.png]]<br />
<br />
=== Uncertainty Representation ===<br />
<br />
One of the big advantages of BM is the ability to represent uncertainty about the prediction. The authors evaluate the uncertainty representation on in-distribution (ID) and out-of-distribution (OOD) samples. <br />
<br />
===== ID uncertainty =====<br />
<br />
For ID (in-distribution) samples, calibration performance is measured, which is a measure of how well the model’s confidence matches its actual accuracy. This measure can be visualized using reliability plots and quantified using a metric called expected calibration error (ECE). ECE is calculated by grouping predictions into M groups based on their confidence score and then finding the absolute difference between the average accuracy and average confidence for each group.<br />
The figure below is a reliability plot of ResNet-50 on CIFAR-10 and CIFAR-100 with 15 groups. It shows that BM has a significantly better calibration performance than softmax since the confidence matches the accuracy more closely (this is also reflected in the lower ECE).<br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_F4.png]]<br />
<br />
===== OOD uncertainty =====<br />
<br />
Here, the authors quantify uncertainty using predictive entropy - the larger the predictive entropy, the larger the uncertainty about a prediction. <br />
<br />
The figure below is a density plot of the predictive entropy of ResNet-50 on CIFAR-10. It shows that BM provides significantly better uncertainty estimation compared to other methods since BM is the only method that has a clear peak of high predictive entropy for OOD samples which should have high uncertainty. <br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_F5.png]]<br />
<br />
=== Transfer learning ===<br />
<br />
Belief matching applies the Bayesian principle outside the neural network, which means it can easily be applied to already trained models. Thus, belief matching can be employed in transfer learning scenarios. The authors downloaded the ImageNet pre-trained ResNet-50 weights and fine-tuned the weights of the last linear layer for 100 epochs using an Adam optimizer.<br />
<br />
This table shows the test error rates from transfer learning on CIFAR-10, Food-101, and Cars datasets. Belief matching consistently performs better than softmax. <br />
<br />
[[File:being_bayesian_about_categorical_probability_transfer_learning.png]]<br />
<br />
Belief matching was also tested for the predictive uncertainty for out of dataset samples based on CIFAR-10 as the in distribution sample. Looking at the figure below, it is observed that belief matching significantly improves the uncertainty representation of pre-trained models by only fine-tuning the last layer’s weights. Note that belief matching confidently predicts examples in Cars since CIFAR-10 contains the object category automobiles. In comparison, softmax produces confident predictions on all datasets. Thus, belief matching could also be used to enhance the uncertainty representation ability of pre-trained models without sacrificing their generalization performance.<br />
<br />
[[File: being_bayesian_about_categorical_probability_transfer_learning_uncertainty.png]]<br />
<br />
=== Semi-Supervised Learning ===<br />
<br />
Belief matching’s ability to allow neural networks to represent rich information in their predictions can be exploited to aid consistency based loss function for semi-supervised learning. Consistency-based loss functions use unlabelled samples to determine where to promote the robustness of predictions based on stochastic perturbations. This can be done by perturbing the inputs (which is the VAT model) or the networks (which is the pi-model). Both methods minimize the divergence between two categorical probabilities under some perturbations, thus belief matching can be used by the following replacements in the loss functions. The hope is that belief matching can provide better prediction consistencies using its Dirichlet distributions.<br />
<br />
[[File: being_bayesian_about_categorical_probability_semi_supervised_equation.png]]<br />
<br />
The results of training on ResNet28-2 with consistency based loss functions on CIFAR-10 are shown in this table. Belief matching does have lower classification error rates compared to using a softmax.<br />
<br />
[[File:being_bayesian_about_categorical_probability_semi_supervised_table.png]]<br />
<br />
== Conclusion ==<br />
<br />
Bayesian principles can be used to construct the target distribution by using the categorical probability as a random variable rather than a training label. This can be applied to neural network models by replacing only the softmax and cross-entropy loss, while improving the generalization performance and uncertainty estimation. <br />
<br />
In the future, the authors would like to allow for more expressive distributions in the belief matching framework, such as logistic normal distributions to capture strong semantic similarities among class labels. Furthermore, using input dependent priors would allow for interesting properties that would aid imbalanced datasets and multi-domain learning.<br />
<br />
== Citations ==<br />
<br />
[1] Bridle, J. S. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pp. 227–236. Springer, 1990.<br />
<br />
[2] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. In International Conference on Machine Learning, 2015.<br />
<br />
[3] Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 2016.<br />
<br />
[4] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning, 2017. <br />
<br />
[5] MacKay, D. J. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448– 472, 1992.<br />
<br />
[6] Graves, A. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, 2011. <br />
<br />
[7] Mandt, S., Hoffman, M. D., and Blei, D. M. Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research, 18(1):4873–4907, 2017.<br />
<br />
[8] Zhang, G., Sun, S., Duvenaud, D., and Grosse, R. Noisy natural gradient as variational inference. In International Conference of Machine Learning, 2018.<br />
<br />
[9] Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, 2019.<br />
<br />
[10] Osawa, K., Swaroop, S., Jain, A., Eschenhagen, R., Turner, R. E., Yokota, R., and Khan, M. E. Practical deep learning with Bayesian principles. In Advances in Neural Information Processing Systems, 2019.<br />
<br />
[11] Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017.<br />
<br />
[12] Neumann, L., Zisserman, A., and Vedaldi, A. Relaxed softmax: Efficient confidence auto-calibration for safe pedestrian detection. In NIPS Workshop on Machine Learning for Intelligent Transportation Systems, 2018.<br />
<br />
[13] Xie, L., Wang, J., Wei, Z., Wang, M., and Tian, Q. Disturblabel: Regularizing cnn on the loss layer. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.<br />
<br />
[14] Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Efficient_kNN_Classification_with_Different_Numbers_of_Nearest_Neighbors&diff=48923Efficient kNN Classification with Different Numbers of Nearest Neighbors2020-12-02T21:51:41Z<p>T34sun: /* Reconstruction */</p>
<hr />
<div>== Presented by == <br />
Cooper Brooke, Daniel Fagan, Maya Perelman<br />
<br />
== Introduction == <br />
Traditional model-based classification approaches first use training observations to fit a model before predicting test samples. In contrast, the model-free k-nearest neighbors (KNNs) method classifies observations with a majority rules approach, labeling each piece of test data based on its k closest training observations (neighbours). This method has become very popular due to its strong performance and simple implementation. <br />
<br />
There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples, and the second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.<br />
<br />
== Previous Work == <br />
<br />
Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications. <br />
<br />
Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.<br />
<br />
== Motivation == <br />
<br />
Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seeks to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.<br />
<br />
A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.<br />
<br />
== Approach == <br />
<br />
<br />
=== kTree Classification ===<br />
<br />
The proposed kTree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_1.png | center | 800x800px]]<br />
<br />
==== Reconstruction ====<br />
<br />
The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}\in \mathbb{R}^{d\times n} = [x_1,...,x_n]</math> represents the training set can be written as:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2<br />
\end{aligned}$$<br />
<br />
In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples. <br />
<br />
The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:<br />
<br />
$$\begin{aligned}<br />
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})<br />
\end{aligned}$$<br />
<br />
where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features. The Laplacian matrix, also called the graph Laplacian, is a matrix representation of a graph. <br />
<br />
This gives a final objective function of:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})<br />
\end{aligned}$$<br />
<br />
Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.<br />
<br />
==== Calculate ''k'' for training set ====<br />
<br />
Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.<br />
<br />
For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:<br />
<br />
[[File:Approach_Figure_2.png | center | 300x300px]]<br />
<br />
The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 are non-zero.<br />
<br />
==== Train a Decision Tree using ''k'' as the label ====<br />
<br />
In a normal decision tree, the target data is the labels themselves. In contrast, in the kTree method, the target data is the optimal ''k'' value for each sample that was solved for in the previous step. So this decision tree has the following form:<br />
<br />
[[File:Approach_Figure_3.png | center | 300x300px]]<br />
<br />
==== Making Predictions for Test Data ====<br />
<br />
The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbours across '''all''' of the training data.<br />
<br />
=== k*Tree Classification ===<br />
<br />
The proposed k*Tree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_4.png | center | 1000x1000px]]<br />
<br />
Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.<br />
<br />
While all steps previous are the exact same, the k*Tree method not only stores the optimal ''k'' value but also the following information:<br />
<br />
* The training samples that have the same optimal ''k''<br />
* The ''k'' nearest neighbours of the previously identified training samples<br />
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours<br />
<br />
The data stored in each node is summarized in the following figure:<br />
<br />
[[File:Approach_Figure_5.png | center | 800x800px]]<br />
<br />
In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.<br />
<br />
== Experiments == <br />
<br />
In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:<br />
<br />
# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.<br />
# kNN-Based Applicability Domain Approach (AD-kNN) [11]<br />
# kNN Method Based on Sparse Learning (S-kNN) [10]<br />
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]<br />
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]<br />
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]<br />
<br />
The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features. <br />
<br />
<br />
'''A. Experimental Results on Different Sample Sizes'''<br />
<br />
The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.<br />
<br />
[[File:Table_I_kNN.png | center | 1000x1000px]]<br />
<br />
The following key results are noted:<br />
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.<br />
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.<br />
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.<br />
<br />
<br />
'''B. Experimental Results on Different Feature Numbers'''<br />
<br />
The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score, an algorithm that solves maximum likelihood equations numerically [15], was used to rank and select the most information features in the datasets. <br />
<br />
[[File:Table_II_kNN.png | center | 1000x1000px]]<br />
<br />
From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.<br />
<br />
== Conclusion == <br />
<br />
This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods can classify the test samples efficiently and effectively, by designing a training step that reduces the run time of the test stage and thus enhances the performance. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features.<br />
<br />
== Critiques == <br />
<br />
*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy. <br />
* The authors addressed that some of the UCI datasets contained imbalanced data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.). <br />
*While the authors contrast their ktTee and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.<br />
<br />
* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.<br />
<br />
* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.<br />
<br />
* It would be really helpful if Ktrees method can be explained at the very beginning. The transition from KNN to Ktrees is not very smooth.<br />
<br />
* It would be nice to have a comparison of the running costs of different methods to see how much cost the kTree and k*Tree reduced.<br />
<br />
* It would be better to show the key result only on a summary rather than stacking up all results without screening.<br />
<br />
* In the results section, it was mentioned that in the experiment on data sets with different numbers of features, the kTree and k*Tree model did not achieve GS-kNN or S-kNN's accuracies, but was faster in terms of running cost. It might be helpful here if the authors add some more supporting arguments about the benefit of this tradeoff, which appears to be a minor decrease in accuracy for a large improvement in speed. This could further showcase the advantages of the kTree and k*Tree models. More quantitative analysis or real-life scenario examples could be some choices here.<br />
<br />
* An interesting thing to notice while solving for the optimal matrix <math>W^*</math> that minimizes the loss function is that <math>W^*</math> is not necessarily a symmetric matrix. That is, the correlation between the <math>i^{th}</math> entry and the <math>j^{th}</math> entry is different from that between the <math>j^{th}</math> entry and the <math>i^{th}</math> entry, which makes the resulting W* not really semantically meaningful. Therefore, it would be interesting if we may set a threshold on the allowing difference between the <math>ij^{th}</math> entry and the <math>ji^{th}</math> entry in <math>W^*</math> and see if this new configuration will give better or worse results compared to current ones, which will provide better insights of the algorithm.<br />
<br />
* It would be interesting to see how the proposed model work with highly non-linear datasets. In the event it does not work well, it would pose the question: would replacing the k*Tree with a SVM or a neural network improve the accuracy? There could be experiments to show if this variant would prove superior over the original models.<br />
<br />
* The key results are a little misleading - for example the claim "the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method" is false. The kTree method had slightly lower accuracy than both GS-kNN and S-kNN and kTree was also slower than LC-kNN<br />
<br />
== References == <br />
<br />
[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.<br />
<br />
[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.<br />
<br />
[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.<br />
<br />
[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.<br />
<br />
[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.<br />
<br />
[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.<br />
<br />
[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.<br />
<br />
[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.<br />
<br />
[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.<br />
<br />
[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.<br />
<br />
[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013. <br />
<br />
[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.<br />
<br />
[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.<br />
<br />
[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.<br />
<br />
[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_universal_SNP_and_small-indel_variant_caller_using_deep_neural_networks&diff=48922A universal SNP and small-indel variant caller using deep neural networks2020-12-02T21:49:14Z<p>T34sun: /* Background */</p>
<hr />
<div>== Background ==<br />
<br />
<br />
Biological functions are determined by genes, and differences in function are determined by mutants, or alleles(one of two or more alternative forms of a gene that arise by mutation and are found at the same place on a chromosome), of those genes. Determining novel alleles is very important in understanding the genetic variation within a species. For example, most eye colors are determined by different alleles of the gene OCA2. All animals receive one copy of each gene from each of their parents. Mutations of a gene are classified as either homozygous (both copies are the same) or heterozygous (the two copies are different).<br />
<br />
Next-generation sequencing is a very popular technique for sequencing, or reading, DNA. Since all genes are encoded as DNA, sequencing is an essential tool for understanding genes. Next-generation sequencing works by reading short sections of DNA of length k, called k-means, and then piecing them together or aligning them to a reference genome. Next-generation sequencing is relatively fast and inexpensive, although it can randomly misidentify some nucleotides, introducing errors. However, NGS reading is errorful and arises from a complex error process depending on various factors.<br />
<br />
The process of variant calling is determining novel alleles from sequencing data (typically next-generation sequencing data). Some significant alleles only differ from the “standard” version of a gene by only a single base pair, such as the mutation which causes multiple sclerosis. Therefore it is important to be able to accurately call single nucleotide swaps/polymorphisms (SNPs), insertions, and deletions (indels). Calling SNPs and small indels are technically challenging since it requires a program to be able to distinguish between truly novel mutations and errors in the sequencing data.<br />
<br />
Previous approaches usually involve using various statistical techniques, a widely used one is GATK. However, these methods have their weaknesses as some assumptions simply don't hold (i.e. independence assumptions), and it's hard to generalize them to other sequencing technologies.<br />
<br />
This paper aims to solve the problem of calling SNPs and small indels using a convolutional neural net by casting the reads as images and classifying whether or not they contain a mutation. It introduces a variant caller called "DeepVariant", which requires no specialized knowledge, but performs better than previous state-of-art methods.<br />
<br />
== Overview ==<br />
<br />
In Figure 1, the DeepVariant workflow overview is illustrated.<br />
<br />
[[File:figure 111.JPG|Figure 1. In all panels, blue boxes represent data and red boxes are processes]]<br />
<br />
<br />
Initially, the NGS reads aligned to a reference genome are scanned for candidate variants which are different sites from the reference genome. The read and reference data are encoded as an image for each candidate variant site. Then, trained CNN can compute the genotype likelihoods, (heterozygous or homozygous) for each of the candidate variants (figure1, left box). <br />
To train the CNN for image classification purposes, the DeepVariant machinery makes pileup images for a labeled sample with known genotypes. These labeled images and known genotypes are provided to CNN for training, and a stochastic gradient descent algorithm is used to optimize the CNN parameters to maximize genotype prediction accuracy. After the convergence of the model, the final model is frozen to use for calling mutations for other image classification tests (figure1, middle box).<br />
For example, in figure 1 (right box), the reference and read bases are encoded into a pileup image at a candidate variant site. CNN using this encoded image computes the genotype likelihoods for the three diploid genotype states of homozygous reference (hom-ref), heterozygous (het) or homozygous alternate (hom-alt). In this example, a heterozygous variant call is emitted, as the most probable genotype here is “het”.<br />
<br />
== Preprocessing ==<br />
<br />
Before the sequencing reads can be fed into the classifier, they must be preprocessed. There are many pre-processing steps that are necessary for this algorithm. These steps represent the real novelty in this technique, by transforming the data in a way that allows us to use more common neural network architectures for classification. The preprocessing of the data can be broken into three main phases: the realignment of reads, finding candidate variants and creating images of the candidate variants. <br />
<br />
The realignment of the reads phase of the preprocessing is important in order to ensure the sequences can be properly compared to the reference sequences. First, the sequences are aligned to a reference sequence. Reads that align poorly are grouped with other reads around them to build that section, or haplotype, from scratch. If there is strong evidence that the new version of the haplotype fits the reads well, the reads are re-aligned to it. This process updates the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string, a way to represent the alignment of a sequence to a reference, for each read.<br />
<br />
Once the reads are properly aligned, the algorithm then proceeds to find candidate variants, regions in the DNA sequence that may contain variants. It is these candidate variants that will eventually be passed as input to the neural network. To find these, we need to consider each position in the reference sequence independently. Any unusable reads are filtered at this point. This includes reads that are not aligned properly, ones that are marked as duplicates, those that fail vendor quality checks, or whose mapping quality is less than ten. For each site in the genome, we collect all the remaining reads that overlap that site. The corresponding allele aligned to that site is then determined by decoding the CIGAR string, which was updated in the realignment phase, of each read. The alleles are then classified into one of four categories: reference-matching base, reference-mismatching base, insertion with a specific sequence, or deletion with a specific length, and the number of occurrences of each distinct allele across all reads is counted. Read bases are only included as potential alleles if each base in the allele has a quality score of at least 10.<br />
<br />
With candidate variants identified, the last phase of pre-processing is to convert these candidate variants into images representing the data. This allows for the use of well established convolutional neural networks for image classification for this specialized problem. Each colour channel is used to store a different piece of information about a candidate variant. The red channel encodes which base we have (A, G, C, or T), by mapping each base to a particular value. The quality of the read is mapped to the green colour channel. And finally, the blue channel encodes whether or not the reference is on the positive strand of the DNA. Each row of the image represents a read, and each column represents a particular base in that read. The reference strand is repeated for the first five rows of the encoded image, in order to maintain its information after a 5x5 convolution is applied.<br />
With the data preprocessing complete, the images can then be passed into the neural network for classification.<br />
<br />
== Neural Network ==<br />
<br />
The neural network used is a convolutional neural network. Although the full network architecture is not revealed in the paper, there are several details which we can discuss. The architecture of the network is an input layer attached to an adapted Inception v2 ImageNet model with nine partitions. The input layer takes as input the images representing the candidate variants and rescales them to 299x299 pixels. The output layer is a three-class Softmax layer initialized with Gaussian random weights with a standard deviation of 0.001. This final layer is fully connected to the previous layer. The three classes are the homozygous reference (meaning it is not a variant), heterozygous variant, and homozygous variant. The candidate variant is classified into the class with the highest probability. The model is trained using stochastic gradient descent with a weight decay of 0.00004. The training was done in mini-batches, each with 32 images, using a root mean squared (RMS) decay of 0.9. For the multiple sequencing technologies experiments, a single model was trained with a learning rate of 0.0015 and momentum 0.8 for 250,000 update steps. For all other experiments, multiple models were trained, and the one with the highest accuracy on the training set was chosen as the final model. The multiple models stem from using each combination of the possible parameter values for the learning rate (0.00095, 0.001, 0.0015) and momentum (0.8, 0.85, 0.9). These models were trained for 80 hours, or until the training accuracy converged.<br />
<br />
== Results ==<br />
<br />
DeepVariant was trained using data available from the CEPH (Centre d’Etude du Polymorphism Humain) female sample NA12878 and was evaluated on the unseen Ashkenazi male sample NA24385. The results were compared with other most commonly used bioinformatics methods, such as the GATK, FreeBayes22, SAMtools23, 16GT24 and Strelka25 (Table 1). For better comparison, the overall accuracy (F1), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are illustrated over the whole genome.<br />
<br />
[[File:table 11.JPG]]<br />
<br />
DeepVariant showed the highest accuracy and more than 50% fewer errors per genome compared to the next best algorithm. <br />
<br />
They also evaluated the same set of algorithms using the synthetic diploid sample CHM1-CHM1326 (Table 2).<br />
<br />
[[File:Table 333.JPG]]<br />
<br />
Results illustrated that the DeepVariant method outperformed all other algorithms for variant calling (SNP and indel) and showed the highest accuracy in terms of F1, Recall, precision and TP.<br />
<br />
== Conclusion ==<br />
<br />
This endeavour to further advance a data-centric approach to understanding the gene sequence illustrate the advantages of deep learning over humans. With billions of DNA base pairs, no humans are able to digest that amount of gene expressions. In the past, computational technique are unfeasible due to the lack in compute power but in the 21st century, it seems that machine learning is the way to go for molecular biology.<br />
<br />
DeepVariant’s strong performance on human data proves that deep learning is a promising technique for variant calling. Perhaps the most exciting feature of DeepVariant is its simplicity. Unlike other states of the art variant callers, DeepVariant has no knowledge of the sequencing technologies that create the reads, or even the biological processes that introduce mutations. This simplifies the problem of variant calling to preprocessing the reads and training a generic deep learning model. It also suggests that DeepVariant could be significantly improved by tailoring the preprocessing to specific sequencing technologies and/or developing a dedicated CNN architecture for the reads, rather than trying to cast them as images.<br />
<br />
== Critique and Discussion==<br />
<br />
The paper presents an interesting method for solving an important problem. Building "images" of reads and running them through a generic image classification CNN seems like a strange approach, and it is interesting that it works well. The biggest issues with the paper are the lack of specific information about how the methods. Some extra information is included in the supplementary material, but there are still some big gaps. In particular:<br />
<br />
1. What is the structure of the neural net? How many layers, and what sizes? The paper for ConvNet which is cited does not have this information. We suspect that this might be a trade secret that Google is protecting.<br />
<br />
2. How is the realignment step implemented? The paper mentions that it uses a "De-Bruijn-graph-based read assembly procedure" to realign reads to a new haplotype. This is a non-standard step in most genomics workflows yet the paper does not describe how they do the realignment or how they build the haplotypes.<br />
<br />
3. How did they settle on the image construction algorithm? The authors provide pseudocode for the construction of pileup images but they do not describe how the decisions for made. For instance, the colour values for different base pairs are not evenly spaced. Also, the image begins with 5 rows of the reference genome.<br />
<br />
One thing we appreciated about the paper was their commentary on future developments. The authors make it very clear that this approach can be improved on and provide specific ideas for next steps.<br />
<br />
Overall, the paper presents an interesting idea with strong results, but lacks detail in some key pieces of the implementation.<br />
<br />
The topic of this project is good but we need to more details of the algorithm. In the neural network part, the details are not enough, Authors should provide a figure to better explain how the model works and the structure of the model. Otherwise we cannot understand how the model works.<br />
<br />
4 We probably want more details about how the algorithm was exactly implemented to work for this project. Also, when we are preprocessing the data, if different data have different lengths, shall we add more information or drop some information so they match?<br />
<br />
5 Particularly, which package did the researchers use to perform this analysis? Different packages of deep learning can have different accuracy and efficiency while making predictions on this data set. <br />
<br />
Further studies on DeepVariant [https://www.nature.com/articles/s41598-018-36177-7 have shown] that it is a framework with great potential and sets the standard in the medical genetics field.<br />
<br />
6. It was mentioned that part of the network used is an "adapted Inception v2 ImageNet model" - does this mean that it used an Inception v2 model that was pretrained on ImageNet? This is not clear, but if this is the case then why is this useful? Why would the features that are extracted for an image be useful for genomics? Did they try using a model that wasn't pretrained? Also, they describe the preprocessing method used but were there any alternatives that they considered?<br />
<br />
== References ==<br />
[1] Hartwell, L.H. ''et. al.'' ''Genetics: From Genes to Genomes''. (McGraw-Hill Ryerson, 2014).<br />
<br />
[2] Poplin, R. ''et. al''. A universal SNP and small-indel variant caller using deep neural networks. ''Nature Biotechnology'' '''36''', 983-987 (2018).</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Music_Recommender_System_Based_using_CRNN&diff=48921Music Recommender System Based using CRNN2020-12-02T21:45:54Z<p>T34sun: /* Critiques/ Insights: */</p>
<hr />
<div>==Introduction and Objective:==<br />
<br />
In the digital era of music streaming, companies, such as Spotify and Pandora, are faced with the following challenge: can they provide users with relevant and personalized music recommendations amidst the ever-growing abundance of music and user data.<br />
<br />
The objective of this paper is to implement a personalized music recommender system that takes user listening history as input and continually finds new music that captures individual user preferences.<br />
<br />
This paper argues that a music recommendation system should vary from the general recommendation system used in practice since it should combine music feature recognition and audio processing technologies to extract music features, and combine them with data on user preferences.<br />
<br />
The authors of this paper took a content-based music approach to build the recommendation system - specifically, comparing the similarity of features based on the audio signal.<br />
<br />
The following two-method approach for building the recommendation system was followed:<br />
#Make recommendations including genre information extracted from classification algorithms.<br />
#Make recommendations without genre information.<br />
<br />
The authors used convolutional recurrent neural networks (CRNN), which is a combination of convolutional neural networks (CNN) and recurrent neural network(RNN), as their main classification model.<br />
<br />
==Methods and Techniques:==<br />
Collaborative filtering method is used by many music recommendation systems, which was explicitly based on ratings of a song from users to find similar users or items. However, people tend not to rate a song when listening to music. Nowadays, deep learning techniques are employed to music recommendation systems. Content based recommendation systems use item content to find similarities between items and the recommended items for users. <br />
<br />
Generally, a music recommender can be divided into three main parts: (I) users, (ii) items, and (iii) user-item matching algorithms. First, we generated users' music tastes based on their profiles. Second, item profiling includes editorial, cultural, and acoustic metadata and they were collected for listeners' satisfaction. Finally, we come to the matching algorithm that suggests recommended personalized music to listeners. <br />
<br />
To classify music, the original music’s audio signal is converted into a spectrogram image. Using the image and the Short Time Fourier Transform (STFT), we convert the data into the Mel scale which is used in the CNN and CRNN models. <br />
=== Mel Scale: === <br />
The scale of pitches that are heard by listeners, which translates to equal pitch increments.<br />
<br />
[[File:Mel.png|frame|none|Mel Scale on Spectrogram]]<br />
<br />
=== Short Time Fourier Transform (STFT): ===<br />
The transformation that determines the sinusoidal frequency of the audio, with a Hanning smoothing function. In the continuous case this is written as: <math>\mathbf{STFT}\{x(t)\}(\tau,\omega) \equiv X(\tau, \omega) = \int_{-\infty}^{\infty} x(t) w(t-\tau) e^{-i \omega t} \, d t </math><br />
<br />
where: <math>w(\tau)</math> is the Hanning smoothing function<br />
<br />
=== Convolutional Neural Network (CNN): ===<br />
A Convolutional Neural Network is a Neural Network that uses convolution in place of matrix multiplication for some layer calculations. By training the data, weights for inputs are updated to find the most significant data relevant to classification. These convolutional layers gather small groups of data with kernels and try to find patterns that can help find features in the overall data. The features are then used for classification. Padding is another technique used to extend the pixels on the edge of the original image to allow the kernel to more accurately capture the borderline pixels. Padding is also used if one wishes the convolved output image to have a certain size. The image on the left represents the mathematical expression of a convolution operation, while the right image demonstrates an application of a kernel on the data.<br />
<br />
[[File:Convolution.png|thumb|400px|left|Convolution Operation]]<br />
[[File:PaddingKernels.png|thumb|400px|center|Example of Padding (white 0s) and Kernels (blue square)]]<br />
<br />
=== Convolutional Recurrent Neural Network (CRNN): === <br />
Similar Neural Network as CNN, with the addition of a GRU, which is a Recurrent Neural Network (RNN). An RNN is used to treat sequential data, by reusing the activation function of previous nodes to update the output. A Gated Recurrent Unit (GRU) is used to store more long-term memory and will help train the early hidden layers.<br />
<br />
[[File:GRU441.png|thumb|400px|left|Gated Recurrent Unit (GRU)]]<br />
[[File:Recurrent441.png|thumb|400px|center|Diagram of General Recurrent Neural Network]]<br />
<br />
==Data Screening:==<br />
<br />
The authors of this paper used a publicly available music dataset made up of 25,000 30-second songs from the Free Music Archives which contain 16 different genres. The data is cleaned up by removing low audio quality songs, wrong labelled genre and those that has multiple genres. To ensure a balanced dataset, only 1000 songs each from the genres of classical, electronic, folk, hip-hop, instrumental, jazz and rock were used in the final model. <br />
<br />
[[File:Data441.png|thumb|200px|none|Data sorted by music genre]]<br />
<br />
==Implementation:==<br />
<br />
=== Modeling Neural Networks ===<br />
<br />
As noted previously, both CNNs and CRNNs were used to model the data. The advantage of CRNNs is that they are able to model time sequence patterns in addition to frequency features from the spectrogram, allowing for greater identification of important features. Furthermore, feature vectors produced before the classification stage could be used to improve accuracy. <br />
<br />
In implementing the neural networks, the Mel-spectrogram data was split up into training, validation, and test sets at a ratio of 8:1:1 respectively and labelled via one-hot encoding. This made it possible for the categorical data to be labelled correctly for binary classification. As opposed to classical stochastic gradient descent, the authors opted to use Adam optimization to update weights in the training phase. Binary cross-entropy was used as the loss function. <br />
<br />
In both the CNN and CRNN models, the data was trained over 100 epochs with a binary cross-entropy loss function. Notable model specific details are below:<br />
<br />
'''CNN'''<br />
* Five convolutional layers with 3x3 kernel, stride 1, padding, batch normalization, and ReLU activation<br />
* Max pooling layers <br />
* The sigmoid function was used as the output layer<br />
<br />
'''CRNN'''<br />
* Four convolutional layers with 3x3 kernel, stride 1, padding, batch normalization, ReLU activation, and dropout rate 0.1<br />
* Max pooling layers <br />
* Two GRU layers<br />
* The sigmoid function was used as the output layer<br />
<br />
The CNN and CRNN architecture is also given in the charts below.<br />
<br />
[[File:CNN441.png|thumb|800px|none|Implementation of CNN Model]]<br />
[[File:CRNN441.png|thumb|800px|none|Implementation of CRNN Model]]<br />
<br />
=== Music Recommendation System ===<br />
<br />
The recommendation system is computed by the cosine similarity of the extraction features from the neural network. Each genre will have a song act as a centre point for each class. The final inputs of the trained neural networks will be the feature variables. The featured variables will be used in the cosine similarity to find the best recommendations. <br />
<br />
The values are between [-1,1], where larger values are songs that have similar features. When the user inputs five songs, those songs become the new inputs in the neural networks and the features are used by the cosine similarity with other music. The largest five cosine similarities are used as recommendations.<br />
[[File:Cosine441.png|frame|100px|none|Cosine Similarity]]<br />
<br />
== Evaluation Metrics ==<br />
=== Precision: ===<br />
* The proportion of True Positives with respect to the '''predicted''' positive cases (true positives and false positives)<br />
* For example, out of all the songs that the classifier '''predicted''' as Classical, how many are actually Classical?<br />
* Describes the rate at which the classifier predicts the true genre of songs among those predicted to be of that certain genre<br />
<br />
=== Recall: ===<br />
* The proportion of True Positives with respect to the '''actual''' positive cases (true positives and false negatives)<br />
* For example, out of all the songs that are '''actually''' Classical, how many are correctly predicted to be Classical?<br />
* Describes the rate at which the classifier predicts the true genre of songs among the correct instances of that genre<br />
<br />
=== F1-Score: ===<br />
An accuracy metric that combines the classifier’s precision and recall scores by taking the harmonic mean between the two metrics:<br />
<br />
[[File:F1441.png|frame|100px|none|F1-Score]]<br />
<br />
=== Receiver operating characteristics (ROC): ===<br />
* A graphical metric that is used to assess a classification model at different classification thresholds <br />
* In the case of a classification threshold of 0.5, this means that if <math>P(Y = k | X = x) > 0.5</math> then we classify this instance as class k<br />
* Plots the true positive rate versus false positive rate as the classification threshold is varied<br />
<br />
[[File:ROCGraph.jpg|thumb|400px|none|ROC Graph. Comparison of True Positive Rate and False Positive Rate]]<br />
<br />
=== Area Under the Curve (AUC) ===<br />
AUC is the area under the ROC in doing so, the ROC provides an aggregate measure across all possible classification thresholds.<br />
<br />
In the context of the paper: When scoring all songs as <math>Prob(Classical | X=x)</math>, it is the probability that the model ranks a random Classical song at a higher probability than a random non-Classical song.<br />
<br />
[[File:AUCGraph.jpg|thumb|400px|none|Area under the ROC curve.]]<br />
<br />
== Results ==<br />
=== Accuracy Metrics ===<br />
The table below is the accuracy metrics with the classification threshold of 0.5.<br />
<br />
[[File:TruePositiveChart.jpg|thumb|none|True Positive / False Positive Chart]]<br />
On average, CRNN outperforms CNN in true positive and false positive cases. In addition, it is very apparent that false positives are much more frequent for songs in the Instrumental genre, perhaps indicating that more pre-processing needs to be done for songs in this genre or that it should be excluded from analysis completely given how most music has instrumental components.<br />
<br />
<br />
[[File:F1Chart441.jpg|thumb|400px|none|F1 Chart]]<br />
On average, CRNN outperforms CNN in F1-score. <br />
<br />
<br />
[[File:AUCChart.jpg|thumb|400px|none|AUC Chart]]<br />
On average, CRNN also outperforms CNN in AUC metric.<br />
<br />
<br />
CRNN models that consider the frequency features and time sequence patterns of songs have a better classification performance through metrics such as F1 score and AUC when comparing to CNN classifier.<br />
<br />
=== Evaluation of Music Recommendation System: ===<br />
<br />
* A listening experiment was performed with 30 participants to access user responses to given music recommendations.<br />
* Participants choose 5 pieces of music they enjoyed and the recommender system generated 5 new recommendations. The participants then evaluated the music recommendation by recording whether the song was liked or disliked.<br />
* The recommendation system takes two approaches to the recommendation:<br />
** Method one uses only the value of cosine similarity.<br />
** Method two uses the value of cosine similarity and information on music genre.<br />
*Perform test of significance of differences in average user likes between the two methods using a t-statistic:<br />
[[File:H0441.png|frame|100px|none|Hypothesis test between method 1 and method 2]]<br />
<br />
Comparing the two methods, <math> H_0: u_1 - u_2 = 0</math>, we have <math> t_{stat} = -4.743 < -2.037 </math>, which demonstrates that the increase in average user likes with the addition of music genre information is statistically significant.<br />
<br />
== Conclusion: ==<br />
<br />
Here are two main conclusions obtained from this paper:<br />
<br />
- To increase the predictive capabilities of the music recommendation system, the music genre should be a key feature to analyze.<br />
<br />
- To extract the song genre from a song’s audio signals and get overall better performance, CRNN’s are superior to CNN’s as they consider frequency in features and time sequence patterns of audio signals. <br />
<br />
According to analyses in the paper, the authors also suggested adding other music features like tempo gram for capturing local tempo to improve the accuracy of the recommender system.<br />
<br />
== Critiques/ Insights: ==<br />
# The authors fail to give reference to the performance of current recommendation algorithms used in the industry; my critique would be for the authors to bench-mark their novel approach with other recommendation algorithms such as collaborative filtering to see if there is a lift in predictive capabilities.<br />
# The listening experiment used to evaluate the recommendation system only includes songs that are outputted by the model. Users may be biased if they believe all songs have come from a recommendation system. To remove bias, we suggest having 15 songs where 5 songs are recommended and 10 songs are set. With this in the user’s mind, it may remove some bias in response and give more accurate predictive capabilities. <br />
# They could go into more details about how CRNN makes it perform better than CNN, in terms of attributes of each network.<br />
# The methodology introduced in this paper is probably also suitable for movie recommendations. As music is presented as spectrograms (images) in a time sequence, and it is very similar to a movie. <br />
# The way of evaluation is a very interesting approach. Since it's usually not easy to evaluate the testing result when it's subjective. By listing all these evaluations' performance, the result would be more comprehensive. A practice that might reduce bias is by coming back to the participants after a couple of days and asking whether they liked the music that was recommended. Often times music "grows" on people and their opinion of a new song may change after some time has passed. <br />
# The paper lacks the comparison between the proposed algorithm and the music recommendation algorithms being used now. It will be clearer to show the superiority of this algorithm.<br />
# The GAN neural network has been proposed to enhance the performance of the neural network, so an improved result may appear after considering using GAN.<br />
# The limitation of CNN and CRNN could be that they are only able to process the spectrograms with single labels rather than multiple labels. This is far from enough for the music recommender systems in today's music industry since the edges between various genres are blurred.<br />
# Is it possible for CNN and CRNN to identify different songs? The model would be harder to train, based on my experience, the efficiency of CNN in R is not very high, which can be improved for future work.<br />
# according to the author, the recommender system is done by calculating the cosine similarity of extraction features from one music to another music. Is possible to represent it by Euclidean distance or p-norm distances?<br />
# In real-life application, most of the music software will have the ability to recommend music to the listener and ask do they like the music that was recommended. It would be a nice application by involving some new information from the listener.<br />
# This paper is very similar to another [https://link.springer.com/chapter/10.1007/978-3-319-46131-1_29 paper], written by Bruce Fewerda and Markus Schedl. Both papers are suggesting methods of building music recommendation systems. However, this paper recommends music based on genre, but the paper written by Fewerda and Schedl suggests a personality-based user modeling for music recommender systems.<br />
# Actual music listeners do not listen to one genre of music, and in fact listening to the same track or the same genre would be somewhat unusual. Could this method be used to make recommendations not on genre, but based on other categories? (Such as the theme of the lyrics, the pitch of the singer, or the date published). Would this model be able to differentiate between tracks of varying "lyric vocabulation difficulty"? Or would NLP algorithms be needed to consider lyrics?<br />
# This model can be applied to many other fields such as recommending the news in the news app, recommending things to buy in the amazon, recommending videos to watch in YOUTUBE and so on based on the user information.<br />
# Looks like for the most genres, CRNN outperforms CNN, but CNN did do better on a few genres (like Jazz), so it might be better to mix them together or might use CNN for some genres and CRNN for the rest.<br />
# Cosine similarity is used to find songs with similar patterns as the input ones from users. That is, feature variables are extracted from the trained neural network model before the classification layer, and used as the basis to find similar songs. One potential problem of this approach is that if the neural network classifies an input song incorrectly, the extracted feature vector will not be a good representation of the input song. Thus, a song that is in fact really similar to the input song may have a small cosine similarity value, i.e. not be recommended. In conclusion, if the first classification is wrong, future inferences based on that is going to make it deviate further from the true answer. A possible future improvement will be how to offset this inference error.<br />
# In the tables when comparing performance and accuracies of the CNN and CRNN models on different genres of music, the researchers claimed that CRNN had superior performance to CNN models. This seemed intuitive, especially in the cases when the differences in accuracies were large. However, maybe the researchers should consider including some hypothesis testing statistics in such tables, which would support such claims in a more rigorous manner.<br />
# A music recommender system that doesn't use the song's meta data such as artist and genre and rather tries to classify genre itself seems unproductive. I also believe that the specific artist matters much more than the genre since within a genre you have many different styles. It just seems like the authors hamstring their recommender system by excluding other relevant data.<br />
<br />
# This summary is well organized with detailed explanation to the music recommendation algorithm. However, since the data used in this paper is cleaned to buffer the efficiency of the recommendation, there should be a section evaluating the impact of noise on the performance this algorithm and how to minimize the impact.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Yktan&diff=47363User:Yktan2020-11-28T20:17:31Z<p>T34sun: /* Motivation */</p>
<hr />
<div><br />
== Introduction ==<br />
<br />
Much of the success in training deep neural networks (DNNs) is thanks to the collection of large datasets with human-annotated labels. However, human annotation is a both time-consuming and expensive task, especially for the data that requires expertise such as medical data. Furthermore, certain datasets will be noisy due to the biases introduced by different annotators.<br />
<br />
There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. Other LNL methods estimate a noise transition matrix and employ it to correct the loss function. An example of a popular loss correction approach is the bootstrapping loss approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples.<br />
<br />
This paper introduces DivideMix, which combines approaches from LNL and SSL. One unique thing about DivideMix is that it discards sample labels that are highly likely to be noisy and leverages these noisy samples as unlabeled data instead. This prevents the model from overfitting and improves generalization performance. Key contributions of this work are:<br />
1) Co-divide, which trains two networks simultaneously, aims to improve generalization and avoiding confirmation bias.<br />
2) During the SSL phase, an improvement is made on an existing method (MixMatch) by combining it with another method (MixUp).<br />
3) Significant improvements to state-of-the-art results on multiple conditions are experimentally shown while using DivideMix. Extensive ablation study and qualitative results are also shown to examine the effect of different components.<br />
<br />
== Motivation ==<br />
<br />
While much has been achieved in training DNNs with noisy labels and SSL methods individually, not much progress has been made in exploring their underlying connections and building on top of the two approaches simultaneously. <br />
<br />
Existing LNL methods aim to correct the loss function by:<br />
<ol><br />
<li> Treating all samples equally and correcting loss explicitly or implicitly through relabeling of the noisy samples<br />
<li> Reweighting training samples or separating clean and noisy samples, which results in correction of the loss function<br />
</ol><br />
<br />
A few examples of LNL methods include:<br />
<ol><br />
<li> Estimating the noise transition matrix, which denote the probability of clean labels flipping to noisy labels, to correct the loss function<br />
<li> Leveraging DNNs to correct labels and using them to modify the loss<br />
<li> Reweighting samples so that noisy labels contribute less to the loss<br />
</ol><br />
<br />
However, these methods each have some downsides. For example, it is very challenging to correctly estimate the noise transition matrix in the first method; for the second method, DNNs tend to overfit to datasets with high noise ratio; for the third method, we need to be able to identify clean samples, which has also proven to be challenging.<br />
<br />
On the other hand, SSL methods mostly leverage unlabeled data using regularization to improve model performance. A recently proposed method, MixMatch incorporates the two classes of regularization – consistency regularization which enforces the model to produce consistent predictions on augmented input data, and entropy minimization which encourages the model to give high-confidence predictions on unlabeled data, as well as MixUp regularization. <br />
<br />
DivideMix partially adopts LNL in that it removes the labels that are highly likely to be noisy by using co-divide to avoid the confirmation bias problem. It then utilizes the noisy samples as unlabeled data and adopts an improved version of MixMatch (SSL) which accounts for the label noise during the label co-refinement and co-guessing phase. By incorporating SSL techniques into LNL and taking the best of both worlds, DivideMix aims to produce highly promising results in training DNNs by better addressing the confirmation bias problem, more accurately distinguishing and utilizing noisy samples, and performing well under high levels of noise.<br />
<br />
== Model Architecture ==<br />
<br />
DivideMix leverages semi-supervised learning to achieve effective modeling. The sample is first split into a labeled set and an unlabeled set. This is achieved by fitting a Gaussian Mixture Model as a per-sample loss distribution. The unlabeled set is made up of data points with discarded labels deemed noisy. Then, to avoid confirmation bias, which is typical when a model is self-training, two models are being trained simultaneously to filter error for each other. This is done by dividing the data using one model and then training the other model. This algorithm, known as Co-divide, keeps the two networks from converging when training, which avoids the bias from occurring. Figure 1 describes the algorithm in graphical form.<br />
<br />
[[File:ModelArchitecture.PNG | center]]<br />
<br />
<div align="center">Figure 1: Model Architecture of DivideMix</div><br />
<br />
For each epoch, the network divides the dataset into a labeled set consisting of clean data, and an unlabeled set consisting of noisy data, which is then used as training data for the other network, where training is done in mini-batches. For each batch of the labelled samples, co-refinement is performed by using the ground truth label <math> y_b </math>, the predicted label <math> p_b </math>, and the posterior is used as the weight, <math> w_b </math>. <br />
<br />
<center><math> \bar{y}_b = w_b y_b + (1-w_b) p_b </math></center> <br />
<br />
Then, a sharpening function is implemented on this weighted sum to produce the estimate, <math> \hat{y}_b </math> Using all these predicted labels, the unlabeled samples will then be assigned a "co-guessed" label, which should produce a more accurate prediction.<br />
<math> \hat{y}_b=Sharpen(\bar{y}_b,T)={\bar{y}^{c{\frac{1}{T}}}_b}/{\sum_{c=1}^C\bar{y}^{c{\frac{1}{T}}}_b} </math>, for c = 1, 2,....., C.<br />
Having calculated all these labels, MixMatch is applied to the combined mini-batch of labeled, <math> \hat{X} </math> and unlabeled data, <math> \hat{U} </math>, where, for a pair of samples and their labels, one new sample and new label is produced. More specifically, for a pair of samples <math> (x_1,x_2) </math> and their labels <math> (p_1,p_2) </math>, the mixed sample <math> (x',p') </math> is:<br />
<br />
<center><br />
<math><br />
\begin{alignat}{2}<br />
<br />
\lambda &\sim Beta(\alpha, \alpha) \\<br />
\lambda ' &= max(\lambda, 1 - \lambda) \\<br />
x' &= \lambda ' x_1 + (1 - \lambda ' ) x_2 \\<br />
p' &= \lambda ' p_1 + (1 - \lambda ' ) p_2 \\<br />
<br />
\end{alignat}<br />
</math><br />
</center> <br />
<br />
MixMatch transforms <math> \hat{X} </math> and <math> \hat{U} </math> into <math> X' </math> and <math> U' </math>. Then, the loss on <math> X' </math>, <math> L_X </math> (Cross-entropy loss) and the loss on <math> U' </math>, <math> L_U </math> (Mean Squared Error) are calculated. A regularization term, <math> L_{reg} </math>, is introduced to regularize the model's average output across all samples in the mini-batch. Then, the total loss is calculated as:<br />
<br />
<center><math> L = L_X + \lambda_u L_U + \lambda_r L_{reg} </math></center> <br />
<br />
where <math> \lambda_r </math> is set to 1, and <math> \lambda_u </math> is used to control the unsupervised loss.<br />
<br />
Lastly, the stochastic gradient descent formula is updated with the calculated loss, <math> L </math>, and the estimated parameters, <math> \boldsymbol{ \theta } </math>.<br />
<br />
== Results ==<br />
'''Applications'''<br />
<br />
The method was validated using four benchmark datasets: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009)(both contain 50K training images and 10K test images of size 32 × 32), Clothing1M (Xiao et al., 2015), and WebVision (Li et al., 2017a).<br />
Two types of label noise are used in the experiments: symmetric and asymmetric.<br />
An 18-layer PreAct Resnet (He et al., 2016) is trained using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. The network is trained for 300 epochs. The initial learning rate was set to 0.02, and reduced by a factor of 10 after 150 epochs. Before applying the Co-divide and MixMatch strategies, the models were first independently trained over the entire dataset using cross-entropy loss during a "warm-up" period. Initially, training the models in this way prepares a more regular distribution of losses to improve upon in subsequent epochs. The warm-up period is 10 epochs for CIFAR-10 and 30 epochs for CIFAR-100. For all CIFAR experiments, we use the same hyperparameters M = 2, T = 0.5, and α = 4. τ is set as 0.5 except for 90% noise ratio when it is set as 0.6.<br />
<br />
<br />
'''Comparison of State-of-the-Art Methods'''<br />
<br />
The effectiveness of DivideMix was shown by comparing the test accuracy with the most recent state-of-the-art methods: <br />
Meta-Learning (Li et al., 2019) proposes a gradient-based method to find model parameters that are more noise-tolerant; <br />
Joint-Optim (Tanaka et al., 2018) and P-correction (Yi & Wu, 2019) jointly optimize the sample labels and the network parameters;<br />
M-correction (Arazo et al., 2019) models sample loss with BMM and apply MixUp.<br />
The following are the results on CIFAR-10 and CIFAR-100 with different levels of symmetric label noise ranging from 20% to 90%. Both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs were recorded in the following table:<br />
<br />
<br />
[[File:divideMixtable1.PNG | center]]<br />
<br />
From table1, the author noticed that none of these methods can consistently outperform others across different datasets. M-correction excels at symmetric noise, whereas Meta-Learning performs better for asymmetric noise. DivideMix outperforms state-of-the-art methods by a large margin across all noise ratios. The improvement is substantial (∼10% of accuracy) for the more challenging CIFAR-100 with high noise ratios.<br />
<br />
DivideMix was compared with the state-of-the-art methods with the other two datasets: Clothing1M and WebVision. It also shows that DivideMix consistently outperforms state-of-the-art methods across all datasets with different types of label noise. For WebVision, DivideMix achieves more than 12% improvement in top-1 accuracy. <br />
<br />
<br />
'''Ablation Study'''<br />
<br />
The effect of removing different components to provide insights into what makes DivideMix successful. We analyze the results in Table 5 as follows.<br />
<br />
<br />
[[File:DivideMixtable5.PNG | center]]<br />
<br />
The authors find that both label refinement and input augmentation are beneficial for DivideMix. ''Label refinement'' is important for high noise ratio due because samples that are more noisy would be incorrectly divided into the labeled set. ''Augmentation'' upgrades model performance by creating more reliable predictions and by achieving consistency regularization. In addition, the performance drop seen in the ''DivideMix w/o co-training'' highlights the disadvantage of self-training; the model still has dataset division, label refinement and label guessing, but they are all performed by the same model.<br />
<br />
== Conclusion ==<br />
<br />
This paper provides a new and effective algorithm for learning with noisy labels by leveraging SSL. The DivideMix method trains two networks simultaneously and utilizes co-guessing and co-labeling effectively, therefore it is a robust approach to dealing with noise in datasets. DivideMix has also been tested using various datasets with the results consistently being one of the best when compared to other advanced methods.<br />
<br />
Future work of DivideMix is to create an adaptation for other applications such as Natural Language Processing, and incorporating the ideas of SSL and LNL into DivideMix architecture.<br />
<br />
== Critiques/ Insights ==<br />
<br />
1. While combining both models makes the result better, the author did not show the relative time increase using this new combined methodology, which is very crucial considering training a large amount of data, especially for images. In addition, it seems that the author did not perform much on hyperparameters tuning for the combined model.<br />
<br />
2. There is an interesting insight, which is when noise ratio increases from 80% to 90%, the accuracy of DivideMix drops dramatically in both datasets.<br />
<br />
3. There should be further explanation on why the learning rate drops by a factor of 10 after 150 epochs.<br />
<br />
4. It would be interesting to see the effectiveness of this method on other domains such as NLP. I am not aware of noisy training datasets available in NLP, but surely this is an important area to focus on, as much of the available data is collected from noisy sources from the web.<br />
<br />
5. The paper implicitly assumes that a Gaussian mixture model (GMM) is sufficiently capable of identifying noise. Given the nature of a GMM, it would work well for noise that are distributed by a Gaussian distribution but for all other noise, it would probably be only asymptotic. The paper should present theoretical results on noise that are Exponential, Rayleigh, etc. This is particularly important because the experiments were done on massive datasets, but they do not directly address the case when there are not many data points. <br />
<br />
6. Comparing the training result on these benchmark datasets makes the algorithm quite comprehensive. This is a very insightful idea to maintain two networks to avoid bias from occurring.<br />
<br />
== References ==<br />
Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Unsupervised<br />
label noise modeling and loss correction. In ICML, pp. 312–321, 2019.<br />
<br />
David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin<br />
Raffel. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 2019.<br />
<br />
Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach<br />
to learning from noisy labels. In WACV, pp. 1215–1224, 2018.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN&diff=47354Neural Speed Reading via Skim-RNN2020-11-28T20:11:46Z<p>T34sun: /* Introduction */</p>
<hr />
<div>== Group ==<br />
<br />
Mingyan Dai, Jerry Huang, Daniel Jiang<br />
<br />
== Introduction ==<br />
<br />
Recurrent Neural Network (RNN) is the connection between artificial neural network nodes forming a directed graph along with time series and has time dynamic behavior. RNN is derived from a feedforward neural network and can use its memory to process variable-length input sequences. This makes it suitable for tasks such as unsegmented, connected handwriting recognition, and speech recognition.<br />
<br />
In Natural Language Processing, recurrent neural networks (RNNs) are a common architecture used to sequentially ‘read’ input tokens and output a distributed representation for each token. By recurrently updating the hidden state of the neural network, an RNN can inherently require the same computational cost across time. However, when it comes to processing input tokens, it is usually the case that some tokens are less important to the overall representation of a piece of text or a query when compared to others. In particular, when considering question answering, many times the neural network will encounter parts of a passage that are irrelevant when it comes to answering a query that is being asked.<br />
<br />
== Model ==<br />
<br />
In this paper, the authors introduce a model called 'skim-RNN', which takes advantage of ‘skimming’ less important tokens or pieces of text rather than ‘skipping’ them entirely. This models the human ability to skim through passages, or to spend less time reading parts that do not affect the reader’s main objective. While this leads to a loss in the comprehension rate of the text [1], it greatly reduces the amount of time spent reading by not focusing on areas that will not significantly affect efficiency when it comes to the reader's objective.<br />
<br />
'Skim-RNN' works by rapidly determining the significance of each input and spending less time processing unimportant input tokens by using a smaller RNN to update only a fraction of the hidden state. When the decision is to ‘fully read’, that is to not skim the text, Skim-RNN updates the entire hidden state with the default RNN cell. Since the hard decision function (‘skim’ or ‘read’) is non-differentiable, the authors use a gumbel-softmax [2] to estimate the gradient of the function, rather than traditional methods such as REINFORCE (policy gradient)[3]. The switching mechanism between the two RNN cells enables Skim-RNN to reduce the total number of float operations (Flop reduction, or Flop-R). When the skimming rate is high, which often leads to faster inference on CPUs, which makes it very useful for large-scale products and small devices.<br />
<br />
The Skim-RNN has the same input and output interfaces as standard RNNs, so it can be conveniently used to speed up RNNs in existing models. In addition, the speed of Skim-RNN can be dynamically controlled at inference time by adjusting a parameter for the threshold for the ‘skim’ decision.<br />
<br />
=== Related Works ===<br />
<br />
As the popularity of neural networks has grown, significant attention has been given to make them faster and lighter. In particular, relevant work focused on reducing the computational cost of recurrent neural networks has been carried out by several other related works. For example, LSTM-Jump (You et al., 2017) [8] models aim to speed up run times by skipping certain input tokens, as opposed to skimming them. Choi et al. (2017)[9] proposed a model which uses a CNN-based sentence classifier to determine the most relevant sentence(s) to the question and then uses an RNN-based question-answering model. This model focuses on reducing GPU run-times (as opposed to Skim-RNN which focuses on minimizing CPU-time or Flop), and is also focused only on question answering. <br />
<br />
=== Implementation ===<br />
<br />
A Skim-RNN consists of two RNN cells, a default (big) RNN cell of hidden state size <math>d</math> and small RNN cell of hidden state size <math>d'</math>, where <math>d</math> and <math>d'</math> are parameters defined by the user and <math>d' \ll d</math>. This follows the fact that there should be a small RNN cell defined for when text is meant to be skimmed and a larger one for when the text should be processed as normal.<br />
<br />
Each RNN cell will have its own set of weights and bias as well as be any variant of an RNN. There is no requirement on how the RNN itself is structured, rather the core concept is to allow the model to dynamically make a decision as to which cell to use when processing input tokens. Note that skipping text can be incorporated by setting <math>d'</math> to 0, which means that when the input token is deemed irrelevant to a query or classification task, nothing about the information in the token is retained within the model.<br />
<br />
Experimental results suggest that this model is faster than using a single large RNN to process all input tokens, as the smaller RNN requires fewer floating-point operations to process the token. Additionally, higher accuracy and computational efficiency are achieved. <br />
<br />
==== Inference ====<br />
<br />
At each time step <math>t</math>, the Skim-RNN unit takes in an input <math>{\bf x}_t \in \mathbb{R}^d</math> as well as the previous hidden state <math>{\bf h}_{t-1} \in \mathbb{R}^d</math> and outputs the new state <math>{\bf h}_t </math> (although the dimensions of the hidden state and input are the same, this process holds for different sizes as well). In the Skim-RNN, there is a hard decision that needs to be made whether to read or skim the input, although there could be potential to include options for multiple levels of skimming.<br />
<br />
The decision to read or skim is done using a multinomial random variable <math>Q_t</math> over the probability distribution of choices <math>{\bf p}_t</math>, where<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>{\bf p}_t = \text{softmax}(\alpha({\bf x}_t, {\bf h}_{t-1})) = \text{softmax}({\bf W}[{\bf x}_t; {\bf h}_{t-1}]+{\bf b}) \in \mathbb{R}^k</math><br />
</div><br />
<br />
where <math>{\bf W} \in \mathbb{R}^{k \times 2d}</math>, <math>{\bf b} \in \mathbb{R}^{k}</math> are weights to be learned and <math>[{\bf x}_t; {\bf h}_{t-1}] \in \mathbb{R}^{2d}</math> indicates the row concatenation of the two vectors. In this case <math> \alpha </math> can have any form as long as the complexity of calculating it is less than <math> O(d^2)</math>. Letting <math>{\bf p}^1_t</math> indicate the probability for fully reading and <math>{\bf p}^2_t</math> indicate the probability for skimming the input at time <math> t</math>, it follows that the decision to read or skim can be modelled using a random variable <math> Q_t</math> by sampling from the distribution <math>{\bf p}_t</math> and<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>Q_t \sim \text{Multinomial}({\bf p}_t)</math><br />
</div><br />
<br />
Without loss of generality, we can define <math> Q_t = 1</math> to indicate that the input will be read while <math> Q_t = 2</math> indicates that it will be skimmed. Reading requires applying the full RNN on the input as well as the previous hidden state to modify the entire hidden state while skimming only modifies part of the prior hidden state.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \begin{cases}<br />
f({\bf x}_t, {\bf h}_{t-1}) & Q_t = 1\\<br />
[f'({\bf x}_t, {\bf h}_{t-1});{\bf h}_{t-1}(d'+1:d)] & Q_t = 2<br />
\end{cases}<br />
</math><br />
</div><br />
<br />
where <math> f </math> is a full RNN with output of dimension <math>d</math> and <math>f'</math> is a smaller RNN with <math>d'</math>-dimensional output. This has advantage that when the model decides to skim, then the computational complexity of that step is only <math>O(d'd)</math>, which is much smaller than <math>O(d^2)</math> due to previously defining <math> d' \ll d</math>.<br />
<br />
==== Training ====<br />
<br />
Since the expected loss/error of the model is a random variable that depends on the sequence of random variables <math> \{Q_t\} </math>, the loss is minimized with respect to the distribution of the variables. Defining the loss to be minimized while conditioning on a particular sequence of decisions<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L(\theta\vert Q)<br />
</math><br />
</div><br />
where <math>Q=Q_1\dots Q_T</math> is a sequence of decisions of length <math>T</math>, then the expected loss o ver the distribution of the sequence of decisions is<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
\mathbb{E}[L(\theta)] = \sum_{Q} L(\theta\vert Q)P(Q) = \sum_Q L(\theta\vert Q) \Pi_j {\bf p}_j^{Q_j}<br />
</math><br />
</div><br />
<br />
Since calculating <math>\delta \mathbb{E}_{Q_t}[L(\theta)]</math> directly is rather infeasible, it is possible to approximate the gradients with a gumbel-softmax distribution [2]. Reparameterizing <math> {\bf p}_t</math> as <math> {\bf r}_t</math>, then the back-propagation can flow to <math> {\bf p}_t</math> without being blocked by <math> Q_t</math> and the approximation can arbitrarily approach <math> Q_t</math> by controlling the parameters. The reparameterized distribution is therefore<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf r}_t^i = \frac{\text{exp}(\log({\bf p}_t^i + {g_t}^i)/\tau)}{\sum_j\text{exp}(\log({\bf p}_t^j + {g_t}^j)/\tau)}<br />
</math><br />
</div><br />
<br />
where <math>{g_t}^i</math> is an independent sample from a <math>\text{Gumbel}(0, 1) = -\log(-\log(\text{Uniform}(0, 1))</math> random variable and <math>\tau</math> is a parameter that represents a temperature. Then it can be rewritten that<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \sum_i {\bf r}_t^i {\bf \tilde{h}}_t<br />
</math><br />
</div><br />
<br />
where <math>{\bf \tilde{h}}_t</math> is the previous equation for <math>{\bf h}_t</math>. The temperature parameter gradually decreases with time, and <math>{\bf r}_t^i</math> becomes more discrete as it approaches 0.<br />
<br />
A final addition to the model is to encourage skimming when possible. Therefore an extra term related to the negative log probability of skimming and the sequence length. Therefore the final loss function used for the model is denoted by <br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L'(\theta) =L(\theta) + \gamma \cdot\frac{1}{T} \sum_i -\log({\bf \tilde{p}}^i_t)<br />
</math><br />
</div><br />
where <math> \gamma </math> is a parameter used to control the ratio between the main loss function and the negative log probability of skimming.<br />
<br />
== Experiment ==<br />
<br />
The effectiveness of Skim-RNN was measured in terms of accuracy and float operation reduction on four classification tasks and a question-answering task. These tasks were chosen because they do not require one’s full attention to every detail of the text, but rather ask for capturing the high-level information (classification) or focusing on a specific portion (QA) of the text, which a common context for speed reading. The tasks themselves are listed in the table below.<br />
<br />
[[File:Table1SkimRNN.png|center|1000px]]<br />
<br />
=== Classification Tasks ===<br />
<br />
In a language classification task, the input was a sequence of words and the output was the vector of categorical probabilities. Each word is embedded into a <math>d</math>-dimensional vector. We initialize the vector with GloVe [4] to form representations of the words and use those as the inputs for a long short-term memory (LSTM) architecture. A linear transformation on the last hidden state of the LSTM and then a softmax function was applied to obtain the classification probabilities. Adam [5] was used for optimization, with an initial learning rate of 0.0001. For Skim-LSTM, <math>\tau = \max(0.5, exp(−rn))</math> where <math>r = 1e-4</math> and <math>n</math> is the global training step, following [2]. We experiment on different sizes of big LSTM (<math>d \in \{100, 200\}</math>) and small LSTM (<math>d' \in \{5, 10, 20\}</math>) and the ratio between the model loss and the skim loss (<math>\gamma\in \{0.01, 0.02\}</math>) for Skim-LSTM. The batch sizes used were 32 for SST and Rotten Tomatoes, and 128 for others. For all models, early stopping was used when the validation accuracy did not increase for 3000 global steps.<br />
<br />
==== Results ====<br />
<br />
[[File:Table2SkimRNN.png|center|1000px]]<br />
<br />
[[File:Figure2SkimRNN.png|center|1000px]]<br />
<br />
Table 2 shows the accuracy and computational cost of the Skim-RNN model compared with other standard models. It is evident that the Skim-RNN model produces a speed-up on the computational complexity of the task while maintaining a high degree of accuracy. Also, it is interesting to know that the accuracy improvement over LSTM could be due to the increased stability of the hidden state, as the majority of the hidden state is not updated when skimming. Figure 2 meanwhile demonstrates the effect of varying the size of the small hidden state as well as the parameter <math>\gamma</math> on the accuracy and computational cost.<br />
<br />
[[File:Table3SkimRNN.png|center|1000px]]<br />
<br />
Table 3 shows an example of a classification task over a IMDb dataset, where Skim-RNN with <math>d = 200</math>, <math>d' = 10</math>, and <math>\gamma = 0.01</math> correctly classifies it with high skimming rate (92%). The goal was to classify the review as either positive or negative. The black words are skimmed, and the blue words are fully read. The skimmed words are clearly irrelevant and the model learns to only carefully read the important words, such as ‘liked’, ‘dreadful’, and ‘tiresome’.<br />
<br />
=== Question Answering Task ===<br />
<br />
In Stanford Question Answering Dataset, the task was to locate the answer span for a given question in a context paragraph. The effectiveness of Skim-RNN for SQuAD was evaluated using two different models: LSTM+Attention and BiDAF [6]. The first model was inspired by most then-present QA systems consisting of multiple LSTM layers and an attention mechanism. This type of model is complex enough to reach reasonable accuracy on the dataset and simple enough to run well-controlled analyses for the Skim-RNN. The second model was an open-source model designed for SQuAD, used primarily to show that Skim-RNN could replace RNN in existing complex systems.<br />
<br />
==== Training ==== <br />
<br />
Adam was used with an initial learning rate of 0.0005. For stable training, the model was pretrained with a standard LSTM for the first 5k steps, and then fine-tuned with Skim-LSTM.<br />
<br />
==== Results ====<br />
<br />
[[File:Table4SkimRNN.png|center|1000px]]<br />
<br />
Table 4 shows the accuracy (F1 and EM) of LSTM+Attention and Skim-LSTM+Attention models as well as VCRNN [7]. It can be observed from the table that the skimming models achieve higher or similar accuracy scores compared to the non-skimming models while also reducing the computational cost by more than 1.4 times. In addition, decreasing layers (1 layer) or hidden size (<math>d=5</math>) improved the computational cost but significantly decreases the accuracy compared to skimming. The table also shows that replacing LSTM with Skim-LSTM in an existing complex model (BiDAF) stably gives reduced computational cost without losing much accuracy (only 0.2% drop from 77.3% of BiDAF to 77.1% of Sk-BiDAF with <math>\gamma = 0.001</math>).<br />
<br />
An explanation for this trend that was given is that the model is more confident about which tokens are important in the second layer. Second, higher <math>\gamma</math> values lead to a higher skimming rate, which agrees with its intended functionality.<br />
<br />
Figure 4 shows the F1 score of LSTM+Attention model using standard LSTM and Skim LSTM, sorted in ascending order by Flop-R (computational cost). While models tend to perform better with larger computational cost, Skim LSTM (Red) outperforms standard LSTM (Blue) with a comparable computational cost. It can also be seen that the computational cost of Skim-LSTM is more stable across different configurations and computational cost. Moreover, increasing the value of <math>\gamma</math> for Skim-LSTM gradually increases the skipping rate and Flop-R, while it also led to reduced accuracy.<br />
<br />
=== Runtime Benchmark ===<br />
<br />
[[File:Figure6SkimRNN.png|center|1000px]]<br />
<br />
The details of the runtime benchmarks for LSTM and Skim-LSTM, which are used to estimate the speedup of Skim-LSTM-based models in the experiments, are also discussed. A CPU-based benchmark was assumed to be the default benchmark, which has a direct correlation with the number of float operations that can be performed per second. As mentioned previously, the speed-up results in Table 2 (as well as Figure 7) are benchmarked using Python (NumPy), instead of popular frameworks such as TensorFlow or PyTorch.<br />
<br />
Figure 7 shows the relative speed gain of Skim-LSTM compared to standard LSTM with varying hidden state size and skim rate. NumPy was used, with the inferences run on a single thread of CPU. The ratio between the reduction of the number of float operations (Flop-R) of LSTM and Skim-LSTM was plotted, with the ratio acting as a theoretical upper bound of the speed gain on CPUs. From here, it can be noticed that there is a gap between the actual gain and the theoretical gain in speed, with the gap being larger with more overhead of the framework or more parallelization. The gap also decreases as the hidden state size increases because the overhead becomes negligible with very large matrix operations. This indicates that Skim-RNN provides greater benefits for RNNs with larger hidden state size. However, combining Skim-RNN with a CPU-based framework can lead to substantially lower latency than GPUs.<br />
<br />
== Results ==<br />
<br />
The results clearly indicate that the Skim-RNN model provides features that are suitable for general reading tasks, which include classification and question answering. While the tables indicate that minor losses in accuracy occasionally did result when parameters were set at specific values, they were minor and were acceptable given the improvement in runtime.<br />
<br />
An important advantage of Skim-RNN is that the skim rate (and thus computational cost) can be dynamically controlled at inference time by adjusting the threshold for<br />
‘skim’ decision probability <math>{\bf p}^1_t</math>. Figure 5 shows the trade-off between the accuracy and computational cost for two settings, confirming the importance of skimming (<math>d' > 0</math>) compared to skipping (<math>d' = 0</math>).<br />
<br />
Figure 6 shows that the model does not skim when the input seems to be relevant to answering the question, which was as expected by the design of the model. In addition, the LSTM in the second layer skims more than that in the first layer mainly because the second layer is more confident about the importance of each token.<br />
<br />
== Conclusion ==<br />
<br />
A Skim-RNN can offer better latency results on a CPU compared to a standard RNN on a GPU, with lower computational cost, as demonstrated through the results of this study. Future work (as stated by the authors) involves using Skim-RNN for applications that require much higher hidden state size, such as video understanding, and using multiple small RNN cells for varying degrees of skimming. Further, since it has the same input and output interface as a regular RNN it can replace RNNs in existing applications.<br />
<br />
== Critiques ==<br />
<br />
1. It seems like Skim-RNN is using the not full RNN of processing words that are not important, thus it can increase speed in some very particular circumstances (ie, only small networks). The extra model complexity did slow down the speed while trying to "optimizing" the efficiency and sacrifice part of accuracy while doing so. It is only trying to target a very specific situation (classification/question-answering) and made comparisons only with the baseline LSTM model. It would be definitely more persuasive if the model can compare with some of the state of art neural network models.<br />
<br />
2. This model of Skim-RNN is pretty good to extract binary classification type of text, thus it would be interesting for this to be applied to stock market news analyzing. For example a press release from a company can be analyzed quickly using this model and immediately give the trader a positive or negative summary of the news. Would be beneficial in trading since time and speed is an important factor when executing a trade.<br />
<br />
3. An appropriate application for Skim-RNN could be customer service chat bots as they can analyze a customer's message and skim associated company policies to craft a response. In this circumstance, quickly analyzing text is ideal to not waste customers time.<br />
<br />
4. This could be applied to news apps to improve readability by highlighting important sections.<br />
<br />
5. This summary describes an interesting and useful model which can save readers time for reading an article. I think it will be interesting that discuss more on training a model by Skim-RNN to highlight the important sections in very long textbooks. As a student, having highlights in the textbook is really helpful to study. But highlight the important parts in a time-consuming work for the author, maybe using Skim-RNN can provide a nice model to do this job. <br />
<br />
6. Besides the good training performance of Skim-RNN, it's good to see the algorithm even performs well simply by training with CPU. It would make it possible to perform the result on lite-platforms.<br />
<br />
== Applications ==<br />
<br />
Recurrent architectures are used in many other applications, such as for processing video. Real-time video processing is an exceedingly demanding and resource-constrained task, particularly in edge settings. It would be interesting to see if this method could be applied to those cases for more efficient inference, such as on drones or self-driving cars.<br />
<br />
== References ==<br />
<br />
[1] Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.<br />
<br />
[2] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.<br />
<br />
[3] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.<br />
<br />
[4] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.<br />
<br />
[5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
<br />
[6] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017a.<br />
<br />
[7] Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. In ICLR, 2017.<br />
<br />
[8] Adams Wei Yu, Hongrae Lee, and Quoc V Le. Learning to skim text. In ACL, 2017.<br />
<br />
[9] Eunsol Choi, Daniel Hewlett, Alexandre Lacoste, Illia Polosukhin, Jakob Uszkoreit, and Jonathan Berant. Coarse-to-fine question answering for long documents. In ACL, 2017.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Point-of-Interest_Recommendation:_Exploiting_Self-Attentive_Autoencoders_with_Neighbor-Aware_Influence&diff=47353Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence2020-11-28T19:58:35Z<p>T34sun: </p>
<hr />
<div>== Presented by == <br />
Guanting(Tony) Pan, Zaiwei(Zara) Zhang, Haocheng(Beren) Chang<br />
<br />
== Introduction == <br />
With the development of mobile devices and location-acquisition technologies, accessing real-time location information is being easier and more efficient. Precisely because of this development, Location-based Social Networks (LBSNs) has become an important part of human’s life. People can share their experiences in a location, such as restaurants and parks, on the Internet. These locations can be seen as a Point-of-Interest (POI) in software such as Maps on our phone. These large amounts of user-POI interaction data can provide a service, which is called personalized POI recommendation, to give recommendations to users that the location they might be interested in. These large amounts of data can be used to train a model through Machine Learning methods(i.e. Classification, Clustering, etc.) to predict a POI that users might be interested in. The POI recommendation system still faces some challenging issues: (1) the difficulty of modeling complex user-POI interactions from sparse implicit feedback; (2) it is difficult to incorporate geographic background information. In order to meet these challenges, this paper will introduce a novel autoencoder-based model to learn non-linear user-POI relations, which is called SAE-NAD. SAE stands for self-attentive encoder while NAD stands for the neighbor-aware decoder. Autoencoder is an unsupervised learning technique that we implement in neural network model for representation learning, meaning that our neural network will contain a "bottleneck" layer that produce a compressed knowledge representation of the original input. This method will include machine learning knowledge that we learned in this course.<br />
<br />
== Previous Work == <br />
<br />
In the previous works, the method is just equally treating users checked in POIs. The drawback of equally treating users checked in POIs is that valuable information about the similarity between users is not utilized, thus reducing the power of such recommenders. However, the SAE adaptively differentiates the user preference degrees in multiple aspects.<br />
<br />
There are some other personalized POI recommendation methods that can be used. Some famous software (e.g. Netflix) uses model-based methods that are built on matrix factorization (MF). For example, ranked based Geographical Factorization Method in [1] adopted weighted regularized MF to serve people on POI. Machine learning is popular in this area. POI recommendation is an important topic in the domain of recommender systems [4]. This paper also described related work in Personalized location recommendation and attention mechanism in the recommendation.<br />
<br />
== Motivation == <br />
This paper reviews encoder and decoder. A single hidden-layer autoencoder is an unsupervised neural network, which is consist of two parts: an encoder and a decoder. The encoder has one activation function that maps the input data to the latent space. The decoder also has one activation function mapping the representations in the latent space to the reconstruction space. And here is the formula:<br />
<br />
[[File: formula.png|center]](Note: a is the activation function)<br />
<br />
The proposed method uses a two-layer neural network to compute the score matrix in the architecture of the SAE. The NAD adopts the RBF kernel to make checked-in POIs exert more influence on nearby unvisited POIs. To train this model, Network training is required.<br />
<br />
This paper will use the datasets in the real world, which are from Gowalla[2], Foursquare [3], and Yelp[3]. These datasets would be used to train by using the method introduced in this paper and compare the performance of SAE-NAD with other POI recommendation methods. Three groups of methods are used to compare with the proposed method, which are traditional MF methods for implicit feedback, Classical POI recommendation methods, and Deep learning-based methods. Specifically, the Deep learning-based methods contain a DeepAE which is a three-hidden-layer autoencoder with a weighted loss function, we can connect this to the material in this course.<br />
<br />
== Methodology == <br />
<br />
=== Notations ===<br />
<br />
Here are the notations used in this paper. It will be helpful when trying to understand the structure and equations in the algorithm.<br />
[[File:notations.JPG|500px|x300px|center]]<br />
<br />
=== Structure ===<br />
<br />
The structure of the network in this paper includes a self-attentive encoder as the input layer(yellow), and a neighbor-aware decoder as the output layer(green).<br />
<br />
[[File:1.JPG|1200px|x600px]]<br />
<br />
=== Self-Attentive Encoder ===<br />
<br />
The self-attentive encoder is the input layer. It transfers the preference vector x_u to hidden representation A_u using weight matrix W^1 and the activation function softmax and tanh. The 0's and 1's in x_u indicates whether the user has been to a certain POI. The weight matrix W_a assigns different weights on various features of POIs.<br />
<br />
[[File:encoder.JPG|center]]<br />
<br />
=== Neighbor-Aware Decoder ===<br />
<br />
POI recommendation uses the geographical clustering phenomenon, which increases the weight of the unvisited POIs that surrounds the visited POIs. Also, an aggregation layer is added to the network to aggregate users’ representations from different aspects into one aspect. This means that a person who has visited a location are very likely to return to this location again in the future, so the user is recommended POIs surrounding this area. An example would be someone who has been to the UW plaza and bought Lazeez are very likely to return to the plaza, therefore the person is recommended to try Mr. Panino's Beijing House.<br />
<br />
[[File:decoder.JPG|center]]<br />
<br />
=== Objective Function ===<br />
<br />
By minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with backpropagation. After that, the training is complete.<br />
<br />
[[File:objective_function.JPG|center]]<br />
<br />
<br />
== Comparative analysis ==<br />
<br />
=== Metrics introduction ===<br />
To obtain a comprehensive evaluation of the effectiveness of the model, the authors performed a thorough comparison between the proposed model and the existing major POI recommendation methods. These methods can be further broken down into three categories: traditional matrix factorization methods for implicit feedback, classical POI recommendation methods, and deep learning-based methods. Here, three key evaluation metrics were introduced as Precison@k, Recall@k, and MAP@k. Through comparing all models on three datasets using the above metrics, it is concluded that the proposed model achieved the best performance.<br />
<br />
To better understand the comparison results, it is critical for one to understand the meanings behind each evaluation metrics. Suppose the proposed model generated k recommended POIs for the user. The first metric, Precison@k, measures the percentage of the recommended POIs which the user has visited. Recall@k is also associated with the user’s behavior. However, it will measure the percentage of recommended POIs in all POIs which have been visited by the user. Lastly, MAP@k represents the mean average precision at k, where average precision is the average of precision values at all k ranks, where relevant POIs are found.<br />
<br />
=== Model Comparison ===<br />
Among all models in the comparison group, RankGeoFM, IRNMF, and PACE produced the best results. Nonetheless, these models are still incomparable to our proposed model. The reasons are explained in details as follows:<br />
<br />
Both RankGeoFM and IRNMF incorporate geographical influence into their ranking models, which is significant for generating POI recommendations. However, they are not capable of capturing non-linear interactions between users and POIs. In comparison, the proposed model, while incorporating geographical influence, adopts a deep neural structure which enables it to measure non-linear and complex interactions. As a result, it outperforms the two methods in the comparison group.<br />
<br />
Moreover, compared to PACE, which is a deep learning-based method, the proposed model offers a more precise measurement of geographical influence. Though PACE is able to capture complex interactions, it models the geographical influence by a context graph, which fails to incorporate user reachability into the modeling process. In contrast, the proposed model is able to capture geographical influence directly through its neighbor-aware decoder, which allows it to achieve better performance than the PACE model.<br />
<br />
[[File:model_comparison.JPG|center]]<br />
<br />
== Conclusion ==<br />
In summary, the proposed model, namely SAE-NAD, clearly showed its advantages compared to many state-of-the-art baseline methods. Its self-attentive encoder effectively discriminates user preferences on check-in POIs, and its neighbor-aware decoder measures geographical influence precisely through differentiating user reachability on unvisited POIs. By leveraging these two components together, it is able to generate recommendations that are highly relevant to its users.<br />
<br />
== Critiques ==<br />
Besides developing the model and conducting a detailed analysis, the authors also did very well in constructing this paper. The paper is well-written and has a highly logical structure. Definitions, notations, and metrics are introduced and explained clearly, which enables readers to follow through the analysis easily. Last but not least, both the abstract and the conclusion of this paper are strong. The abstract concisely reported the objectives and outcomes of the experiment, whereas the conclusion is succinct and precise.<br />
<br />
This idea would have many applications such as in food delivery service apps to suggest new restaurants to customers.<br />
<br />
It would be nice if the author could show the comparison result in tables vs other methodologies. Both in terms of accuracy and time-efficiency. In addition, the drawbacks of this new methodology are unknown to the readers.<br />
<br />
It would also be nice if the authors provided some more ablation on the various components of the proposed method. Even after reading some of their experiments, I do not have a clear understanding of how important each component is to the recommendation quality.<br />
<br />
== References ==<br />
[1] Defu Lian, Cong Zhao, Xing Xie, Guangzhong Sun, Enhong Chen, and Yong Rui. 2014. GeoMF: joint geographical modeling and matrix factorization for point-of-interest recommendation. In KDD. ACM, 831–840.<br />
<br />
[2] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. 2011. Friendship and mobility: user movement in location-based social networks. In KDD. ACM, 1082–1090.<br />
<br />
[3] Yiding Liu, Tuan-Anh Pham, Gao Cong, and Quan Yuan. 2017. An Experimental Evaluation of Point-of-interest Recommendation in Location-based Social Networks. PVLDB 10, 10 (2017), 1010–1021.<br />
<br />
[4] Jie Bao, Yu Zheng, David Wilkie, and Mohamed F. Mokbel. 2015. Recommendations in location-based social networks: a survey. GeoInformatica 19, 3 (2015), 525–565.<br />
<br />
[5] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. ACM, 173–182.<br />
<br />
[6] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In ICDM. IEEE Computer Society, 263–272.<br />
<br />
[7] Santosh Kabbur, Xia Ning, and George Karypis. 2013. FISM: factored item similarity models for top-N recommender systems. In KDD. ACM, 659–667. [12] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).<br />
<br />
[8] Yong Liu,WeiWei, Aixin Sun, and Chunyan Miao. 2014. Exploiting Geographical<br />
Neighborhood Characteristics for Location Recommendation. In CIKM. ACM,<br />
739–748<br />
<br />
[9] Xutao Li, Gao Cong, Xiaoli Li, Tuan-Anh Nguyen Pham, and Shonali Krishnaswamy.<br />
2015. Rank-GeoFM: A Ranking based Geographical Factorization<br />
Method for Point of Interest Recommendation. In SIGIR. ACM, 433–442.<br />
<br />
[10] Carl Yang, Lanxiao Bai, Chao Zhang, Quan Yuan, and Jiawei Han. 2017. Bridging<br />
Collaborative Filtering and Semi-Supervised Learning: A Neural Approach for<br />
POI Recommendation. In KDD. ACM, 1245–1254.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_Networks&diff=46458Graph Structure of Neural Networks2020-11-25T18:02:59Z<p>T34sun: /* Introduction */</p>
<hr />
<div>= Presented By =<br />
<br />
Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang<br />
<br />
= Introduction =<br />
<br />
A deep neural network is composed of neurons organized into layers and the connections between them. The architecture of a neural network can be captured by its "computational graph", where neurons are represented as nodes, and directed edges link neurons in different layers. This graphical representation demonstrates how the network transmits and transforms information through its input neurons through the hidden layer and all the way to the output neurons.<br />
<br />
In Neural Networks research, it is often important to build a relation between a neural network’s accuracy and its underlying graph structure. A natural choice is to use computational graph representation, but this has many limitations including a lack of generality and disconnection with biology/neuroscience. This disconnection between biology/neuroscience makes knowledge less transferable and interdisciplinary research more difficult.<br />
<br />
Thus, we developed a new way of representing a neural network as a graph, called a relational graph. The key insight is to focus on message exchange, rather than just on directed data flow. For example, for a fixed-width fully-connected layer, an input channel and output channel pair can be represented as a single node, while an edge in the relational graph can represent the message exchange between the two nodes. Under this formulation, using the appropriate message exchange definition, it can be shown that the relational graph can represent many types of neural network layers.<br />
<br />
WS-flex is a graph generator that allows for the systematic exploration of the design space of neural networks. Neural networks are characterized by the clustering coefficient and average path length of their relational graphs under the insights of neuroscience.<br />
<br />
= Neural Network as Relational Graph =<br />
<br />
The author proposes the concept of relational graph to study the graphical structure of neural network. Each relational graph is based on an undirected graph <math>G =(V; E)</math>, where <math>V =\{v_1,...,v_n\}</math> is the set of all the nodes, and <math>E \subseteq \{(v_i,v_j)|v_i,v_j\in V\}</math> is the set of all edges that connect nodes. Note that for the graph used here, all nodes have self edges, that is <math>(v_i,v_i)\in E</math>. <br />
<br />
To build a relational graph that captures the message exchange between neurons in the network, we associate various mathematical quantities to the graph <math>G</math>. First, a feature quantity <math>x_v</math> is associated with each node. The quantity <math>x_v</math> might be a scalar, vector or tensor depending on different types of neural networks (see the Table at the end of the section). Then a message function <math>f_{uv}(·)</math> is associated with every edge in the graph. A message function specifically takes a node’s feature as the input and then output a message. An aggregation function <math>{\rm AGG}_v(·)</math> then takes a set of messages (the outputs of message function) and outputs the updated node feature. <br />
<br />
A relation graph is a graph <math>G</math> associated with several rounds of message exchange, which transform the feature quantity <math>x_v</math> with the message function <math>f_{uv}(·)</math> and the aggregation function <math>{\rm AGG}_v(·)</math>. At each round of message exchange, each node sends messages to its neighbors and aggregates incoming messages from its neighbors. Each message is transformed at each edge through the message function, then they are aggregated at each node via the aggregation function. Suppose we have already conducted <math>r-1</math> rounds of message exchange, then the <math>r^{th}</math> round of message exchange for a node <math>v</math> can be described as<br />
<br />
<div style="text-align:center;"><math>\mathbf{x}_v^{(r+1)}= {\rm AGG}^{(r)}(\{f_v^{(r)}(\textbf{x}_u^{(r)}), \forall u\in N(v)\})</math></div> <br />
<br />
where <math>\mathbf{x}^{(r+1)}</math> is the feature of of the <math>v</math> node in the relational graph after the <math>r^{th}</math> round of update. <math>u,v</math> are nodes in Graph <math>G</math>. <math>N(u)=\{u|(u,v)\in E\}</math> is the set of all the neighbor nodes of <math>u</math> in graph <math>G</math>.<br />
<br />
To further illustrate the above, we use the basic Multilayer Perceptron (MLP) as an example. An MLP consists of layers of neurons, where each neuron performs a weighted sum over scalar inputs and outputs, followed by some non-linearity. Suppose the <math>r^{th}</math> layer of an MLP takes <math>x^{(r)}</math> as input and <math>x^{(r+1)}</math> as output, then a neuron computes <br />
<br />
<div style="text-align:center;"><math>x_i^{(r+1)}= \sigma(\Sigma_jw_{ij}^{(r)}x_j^{(r)})</math>.</div> <br />
<br />
where <math>w_{ij}^{(r)}</math> is the trainable weight and <math>\sigma</math> is the non-linearity function. Let's first consider the special case where the input and output of all the layers <math>x^{(r)}</math>, <math>1 \leq r \leq R </math> have the same feature dimensions <math>d</math>. In this scenario, we can have <math>d</math> nodes in the Graph <math>G</math> with each node representing a neuron in MLP. Each layer of neural network will correspond with a round of message exchange, so there will be <math>R</math> rounds of message exchange in total. The aggregation function here will be the summation with non-linearity transform <math>\sigma(\Sigma)</math>, while the message function is simply the scalar multipication with weight. A fully-connected, fixed-width MLP layer can then be expressed with a complete relational graph, where each node <math>x_v</math> connects to all the other nodes in <math>G</math>, that is neighborhood set <math>N(v) = V</math> for each node <math>v</math>. The figure below shows the correspondence between the complete relation graph with a 5-layer 4-dimension fully-connected MLP.<br />
<br />
<div style="text-align:center;">[[File:fully_connnected_MLP.png]]</div><br />
<br />
In fact, a fixed-width fully-connected MLP is only a special case under a much more general model family, where the message function, aggregation function, and most importantly, the relation graph structure can vary. The different relational graph will represent the different topological structure and information exchange pattern of the network, which is the property that the paper wants to examine. The plot below shows two examples of non-fully connected fixed-width MLP and their corresponding relational graphs. <br />
<br />
<div style="text-align:center;">[[File:otherMLP.png]]</div><br />
<br />
We can generalize the above definitions for fixed-width MLP to Variable-width MLP, Convolutional Neural Network (CNN), and other modern network architecture like Resnet by allowing the node feature quantity <math>\textbf{x}_j^{(r)}</math> to be a vector or tensor respectively. In this case, each node in the relational graph will represent multiple neurons in the network, and the number of neurons contained in each node at each round of message change does not need to be the same, which gives us a flexible representation of different neural network architecture. The message function will then change from the simple scalar multiplication to either matrix/tensor multiplication or convolution. The representation of these more complicated networks are described in detail in the paper, and the correspondence between different networks and their relational graph properties is summarized in the table below. <br />
<br />
<div style="text-align:center;">[[File:relational_specification.png]]</div><br />
<br />
Overall, relational graphs provide a general representation for neural networks. With proper definitions of node features and message exchange, relational graphs can represent diverse neural architectures, thereby allowing us to study the performance of different graph structures.<br />
<br />
= Exploring and Generating Relational Graphs=<br />
<br />
We will deal with the design and how to explore the space of relational graphs in this section. There are three parts we need to consider:<br />
<br />
(1) '''Graph measures''' that characterize graph structural properties:<br />
<br />
We will use one global graph measure, average path length, and one local graph measure, clustering coefficient in this paper.<br />
To explain clearly, average path length measures the average shortest path distance between any pair of nodes; the clustering coefficient measures the proportion of edges between the nodes within a given node’s neighborhood, divided by the number of edges that could possibly exist between them, averaged over all the nodes.<br />
<br />
(2) '''Graph generators''' that can generate the diverse graph:<br />
<br />
With selected graph measures, we use a graph generator to generate diverse graphs to cover a large span of graph measures. To figure out the limitation of the graph generator and find out the best, we investigate some generators including ER, WS, BA, Harary, Ring, Complete graph and results shows as below:<br />
<br />
<div style="text-align:center;">[[File:3.2 graph generator.png]]</div><br />
<br />
Thus, from the picture, we could obtain the WS-flex graph generator that can generate graphs with a wide coverage of graph measures; notably, WS-flex graphs almost encompass all the graphs generated by classic random generators mentioned above.<br />
<br />
(3) '''Computational Budget''' that we need to control so that the differences in performance of different neural networks are due to their diverse relational graph structures.<br />
<br />
It is important to ensure that all networks have approximately the same complexities so that the differences in performance are due to their relational graph structures when comparing neutral work by their diverse graph.<br />
<br />
We use FLOPS (# of multiply-adds) as the metric. We first compute the FLOPS of our baseline network instantiations (i.e. complete relational graph) and use them as the reference complexity in each experiment. From the description in section 2, a relational graph structure can be instantiated as a neural network with variable width. Therefore, we can adjust the width of a neural network to match the reference complexity without changing the relational graph structures.<br />
<br />
= Experimental Setup =<br />
The author studied the performance of 3942 sampled relational graphs (generated by WS-flex from the last section) of 64 nodes with two experiments: <br />
<br />
(1) CIFAR-10 dataset: 10 classes, 50K training images, and 10K validation images<br />
<br />
Relational Graph: all 3942 sampled relational graphs of 64 nodes<br />
<br />
Studied Network: 5-layer MLP with 512 hidden units<br />
<br />
<br />
(2) ImageNet classification: 1K image classes, 1.28M training images and 50K validation images<br />
<br />
Relational Graph: Due to high computational cost, 52 graphs are uniformly sampled from the 3942 available graphs.<br />
<br />
Studied Network: <br />
*ResNet-34, which only consists of basic blocks of 3×3 convolutions (He et al., 2016)<br />
<br />
*ResNet-34-sep, a variant where we replace all 3×3 dense convolutions in ResNet-34 with 3×3 separable convolutions (Chollet, 2017)<br />
<br />
*ResNet-50, which consists of bottleneck blocks (He et al., 2016) of 1×1, 3×3, 1×1 convolutions<br />
<br />
*EfficientNet-B0 architecture (Tan & Le, 2019)<br />
<br />
*8-layer CNN with 3×3 convolution<br />
<br />
= Discussions and Conclusions =<br />
<br />
The paper summarizes the result of the experiment among multiple different relational graphs through sampling and analyzing and list six important observations during the experiments, These are:<br />
<br />
* There are always exists graph structure that has higher predictive accuracy under Top-1 error compare to the complete graph<br />
<br />
* There is a sweet spot that the graph structure near the sweet spot usually outperform the base graph<br />
<br />
* The predictive accuracy under top-1 error can be represented by a smooth function of Average Path Length <math> (L) </math> and Clustering Coefficient <math> (C) </math><br />
<br />
* The Experiments is consistent across multiple datasets and multiple graph structure with similar Average Path Length and Clustering Coefficient.<br />
<br />
* The best graph structure can be identified easily.<br />
<br />
* There is a similarity between best artificial neurons and biological neurons.<br />
<br />
----<br />
<br />
<br />
<br />
[[File:Result2_441_2020Group16.png]]<br />
<br />
$$\text{Figure - Results from Experiments}$$<br />
<br />
== Neural networks performance depends on its structure ==<br />
During the experiment, Top-1 errors for all sampled relational graph among multiple tasks and graph structures are recorded. The parameters of the models are average path length and clustering coefficient. Heat maps were created to illustrate the difference in predictive performance among possible average path length and clustering coefficient. In '''Figure - Results from Experiments (a)(c)(f)''', The darker area represents a smaller top-1 error which indicates the model performs better than the light area.<br />
<br />
Compare with the complete graph which has parameter <math> L = 1 </math> and <math> C = 1 </math>, The best performing relational graph can outperform the complete graph baseline by 1.4% top-1 error for MLP on CIFAR-10, and 0.5% to 1.2% for models on ImageNet. Hence it is an indicator that the predictive performance of the neural networks highly depends on the graph structure, or equivalently that the completed graph does not always have the best performance. <br />
<br />
== Sweet spot where performance is significantly improved ==<br />
It had been recognized that training noises often results in inconsistent predictive results. In the paper, the 3942 graphs in the sample had been grouped into 52 bins, each bin had been colored based on the average performance of graphs that fall into the bin. By taking the average, the training noises had been significantly reduced. Based on the heat map '''Figure - Results from Experiments (f)''', the well-performing graphs tend to cluster into a special spot that the paper called “sweet spot” shown in the red rectangle, the rectangle is approximately included clustering coefficient in the range <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math>.<br />
<br />
== Relationship between neural network’s performance and parameters == <br />
When we visualize the heat map, we can see that there is no significant jump of performance that occurred as a small change of clustering coefficient and average path length ('''Figure - Results from Experiments (a)(c)(f)'''). In addition, if one of the variables is fixed in a small range, it is observed that a second-degree polynomial is a good visualization tool for the overall trend ('''Figure - Results from Experiments (b)(d)'''). Therefore, both the clustering coefficient and average path length are highly related to neural network performance by a U-shape. <br />
<br />
== Consistency among many different tasks and datasets ==<br />
They observe that relational graphs with certain graph measures may consistently perform well regardless of how they are instantiated. The paper presents consistency uses two perspectives, one is qualitative consistency and another one is quantitative consistency.<br />
<br />
(1) '''Qualitative Consistency'''<br />
It is observed that the results are consistent from different points of view. Among multiple architecture dataset, it is observed that the clustering coefficient within <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math> consistently outperform the baseline complete graph. <br />
<br />
(2) '''Quantitative Consistency'''<br />
Among different dataset with the network that has similar clustering coefficient and average path length, the results are correlated, The paper mentioned that ResNet-34 is much more complex than 5-layer MLP but a fixed set relational graph would perform similarly in both settings, with Pearson correlation of <math>0.658</math>, the p-value for the Null hypothesis is less than <math>10^{-8}</math>.<br />
<br />
== Top architectures can be identified efficiently ==<br />
The computation cost of finding top architectures can be significantly reduced without training the entire data set for a large value of epoch or a relatively large sample. To achieve the top architectures, the number of graphs and training epochs need to be identified. For the number of graphs, a heatmap is a great tool to demonstrate the result. In the 5-layer MLP on CIFAR-10 example, taking a sample of the data around 52 graphs would have a correlation of 0.9, which indicates that fewer samples are needed for a similar analysis in practice. When determining the number of epochs, correlation can help to show the result. In ResNet34 on ImageNet example, the correlation between the variables is already high enough for future computation within 3 epochs. This means that good relational graphs perform well even at the<br />
initial training epochs.<br />
<br />
== Well-performing neural networks have graph structure surprisingly similar to those of real biological neural networks==<br />
The way we define relational graphs and average length in the graph is similar to the way information is exchanged in network science. The biological neural network also has a similar relational graph representation and graph measure with the best-performing relational graph.<br />
<br />
While there is some organizational similarity between a computational neural network and a biological neural network, we should refrain from saying that both these networks share many similarities or are essentially the same with just different substrates. The biological neurons are still quite poorly understood and it may take a while before their mechanisms are better understood.<br />
<br />
= Critique =<br />
<br />
1. The experiment is only measuring on a single data set which might not be representative enough. As we can see in the whole paper, the "sweet spot" we talked about might be a special feature for the given data set only which is the CIFAR-10 data set. If we change the data set to another imaging data set like CK+, whether we are going to get a similar result is not shown by the paper. Hence, the result that is being concluded from the paper might not be representative enough. <br />
<br />
2. When we are fitting the model in practice, we will fit the model with more than one epoch. The order of the model fitting should be randomized since we should create more random jumps to avoid staked inside a local minimum. With the same order within each epoch, the data might be grouped by different classes or levels, the model might result in a better performance with certain classes and worse performance with other classes. In this particular example, without randomization of the training data, the conclusion might not be precise enough.<br />
<br />
3. This study shows empirical justification for choosing well-performing models from graphs differing only by average path length and clustering coefficient. An equally important question is whether there is the theoretical justification for why these graph properties may (or may not) contribute the performance of a general classifier - for example, is there a combination of these properties that is sufficient to recover the universality theorem for MLP's?<br />
<br />
4. It might be worth looking into how to identify the "sweet spot" for different datasets.<br />
<br />
5. What would be considered a "best graph structure " in the discussion and conclusion part? It seems that the intermediate result of getting an accurate result was by binning graphs into smaller bins, what should we do if the graphs can not be binned into significantly smaller bins in order to proceed with the methodologies mentioned in the paper. Both CIFAR - 10 and ImageNet seem too trivial considering the amount of variation and categories in the dataset. What would the generalizability be to other presentation of images?</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_Networks&diff=46457Graph Structure of Neural Networks2020-11-25T18:02:45Z<p>T34sun: /* Introduction */</p>
<hr />
<div>= Presented By =<br />
<br />
Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang<br />
<br />
= Introduction =<br />
<br />
A deep neural network is composed of neurons organized into layers and the connections between them. The architecture of a neural network can be captured by its "computational graph", where neurons are represented as nodes, and directed edges link neurons in different layers. This graphical representation demonstrates how the network transmits and transforms information through its input neurons through the hidden layer and all the way to the output neurons.<br />
<br />
In Neural Networks research, it is often important to build a relation between a neural network’s accuracy and its underlying graph structure. A natural choice is to use computational graph representation, but this has many limitations including a lack of generality and disconnection with biology/neuroscience. This disconnection between biology/neuroscience makes knowledge less transferable and interdisciplinary research more difficult.<br />
<br />
Thus, we developped a new way of representing a neural network as a graph, called a relational graph. The key insight is to focus on message exchange, rather than just on directed data flow. For example, for a fixed-width fully-connected layer, an input channel and output channel pair can be represented as a single node, while an edge in the relational graph can represent the message exchange between the two nodes. Under this formulation, using the appropriate message exchange definition, it can be shown that the relational graph can represent many types of neural network layers.<br />
<br />
WS-flex is a graph generator that allows for the systematic exploration of the design space of neural networks. Neural networks are characterized by the clustering coefficient and average path length of their relational graphs under the insights of neuroscience.<br />
<br />
= Neural Network as Relational Graph =<br />
<br />
The author proposes the concept of relational graph to study the graphical structure of neural network. Each relational graph is based on an undirected graph <math>G =(V; E)</math>, where <math>V =\{v_1,...,v_n\}</math> is the set of all the nodes, and <math>E \subseteq \{(v_i,v_j)|v_i,v_j\in V\}</math> is the set of all edges that connect nodes. Note that for the graph used here, all nodes have self edges, that is <math>(v_i,v_i)\in E</math>. <br />
<br />
To build a relational graph that captures the message exchange between neurons in the network, we associate various mathematical quantities to the graph <math>G</math>. First, a feature quantity <math>x_v</math> is associated with each node. The quantity <math>x_v</math> might be a scalar, vector or tensor depending on different types of neural networks (see the Table at the end of the section). Then a message function <math>f_{uv}(·)</math> is associated with every edge in the graph. A message function specifically takes a node’s feature as the input and then output a message. An aggregation function <math>{\rm AGG}_v(·)</math> then takes a set of messages (the outputs of message function) and outputs the updated node feature. <br />
<br />
A relation graph is a graph <math>G</math> associated with several rounds of message exchange, which transform the feature quantity <math>x_v</math> with the message function <math>f_{uv}(·)</math> and the aggregation function <math>{\rm AGG}_v(·)</math>. At each round of message exchange, each node sends messages to its neighbors and aggregates incoming messages from its neighbors. Each message is transformed at each edge through the message function, then they are aggregated at each node via the aggregation function. Suppose we have already conducted <math>r-1</math> rounds of message exchange, then the <math>r^{th}</math> round of message exchange for a node <math>v</math> can be described as<br />
<br />
<div style="text-align:center;"><math>\mathbf{x}_v^{(r+1)}= {\rm AGG}^{(r)}(\{f_v^{(r)}(\textbf{x}_u^{(r)}), \forall u\in N(v)\})</math></div> <br />
<br />
where <math>\mathbf{x}^{(r+1)}</math> is the feature of of the <math>v</math> node in the relational graph after the <math>r^{th}</math> round of update. <math>u,v</math> are nodes in Graph <math>G</math>. <math>N(u)=\{u|(u,v)\in E\}</math> is the set of all the neighbor nodes of <math>u</math> in graph <math>G</math>.<br />
<br />
To further illustrate the above, we use the basic Multilayer Perceptron (MLP) as an example. An MLP consists of layers of neurons, where each neuron performs a weighted sum over scalar inputs and outputs, followed by some non-linearity. Suppose the <math>r^{th}</math> layer of an MLP takes <math>x^{(r)}</math> as input and <math>x^{(r+1)}</math> as output, then a neuron computes <br />
<br />
<div style="text-align:center;"><math>x_i^{(r+1)}= \sigma(\Sigma_jw_{ij}^{(r)}x_j^{(r)})</math>.</div> <br />
<br />
where <math>w_{ij}^{(r)}</math> is the trainable weight and <math>\sigma</math> is the non-linearity function. Let's first consider the special case where the input and output of all the layers <math>x^{(r)}</math>, <math>1 \leq r \leq R </math> have the same feature dimensions <math>d</math>. In this scenario, we can have <math>d</math> nodes in the Graph <math>G</math> with each node representing a neuron in MLP. Each layer of neural network will correspond with a round of message exchange, so there will be <math>R</math> rounds of message exchange in total. The aggregation function here will be the summation with non-linearity transform <math>\sigma(\Sigma)</math>, while the message function is simply the scalar multipication with weight. A fully-connected, fixed-width MLP layer can then be expressed with a complete relational graph, where each node <math>x_v</math> connects to all the other nodes in <math>G</math>, that is neighborhood set <math>N(v) = V</math> for each node <math>v</math>. The figure below shows the correspondence between the complete relation graph with a 5-layer 4-dimension fully-connected MLP.<br />
<br />
<div style="text-align:center;">[[File:fully_connnected_MLP.png]]</div><br />
<br />
In fact, a fixed-width fully-connected MLP is only a special case under a much more general model family, where the message function, aggregation function, and most importantly, the relation graph structure can vary. The different relational graph will represent the different topological structure and information exchange pattern of the network, which is the property that the paper wants to examine. The plot below shows two examples of non-fully connected fixed-width MLP and their corresponding relational graphs. <br />
<br />
<div style="text-align:center;">[[File:otherMLP.png]]</div><br />
<br />
We can generalize the above definitions for fixed-width MLP to Variable-width MLP, Convolutional Neural Network (CNN), and other modern network architecture like Resnet by allowing the node feature quantity <math>\textbf{x}_j^{(r)}</math> to be a vector or tensor respectively. In this case, each node in the relational graph will represent multiple neurons in the network, and the number of neurons contained in each node at each round of message change does not need to be the same, which gives us a flexible representation of different neural network architecture. The message function will then change from the simple scalar multiplication to either matrix/tensor multiplication or convolution. The representation of these more complicated networks are described in detail in the paper, and the correspondence between different networks and their relational graph properties is summarized in the table below. <br />
<br />
<div style="text-align:center;">[[File:relational_specification.png]]</div><br />
<br />
Overall, relational graphs provide a general representation for neural networks. With proper definitions of node features and message exchange, relational graphs can represent diverse neural architectures, thereby allowing us to study the performance of different graph structures.<br />
<br />
= Exploring and Generating Relational Graphs=<br />
<br />
We will deal with the design and how to explore the space of relational graphs in this section. There are three parts we need to consider:<br />
<br />
(1) '''Graph measures''' that characterize graph structural properties:<br />
<br />
We will use one global graph measure, average path length, and one local graph measure, clustering coefficient in this paper.<br />
To explain clearly, average path length measures the average shortest path distance between any pair of nodes; the clustering coefficient measures the proportion of edges between the nodes within a given node’s neighborhood, divided by the number of edges that could possibly exist between them, averaged over all the nodes.<br />
<br />
(2) '''Graph generators''' that can generate the diverse graph:<br />
<br />
With selected graph measures, we use a graph generator to generate diverse graphs to cover a large span of graph measures. To figure out the limitation of the graph generator and find out the best, we investigate some generators including ER, WS, BA, Harary, Ring, Complete graph and results shows as below:<br />
<br />
<div style="text-align:center;">[[File:3.2 graph generator.png]]</div><br />
<br />
Thus, from the picture, we could obtain the WS-flex graph generator that can generate graphs with a wide coverage of graph measures; notably, WS-flex graphs almost encompass all the graphs generated by classic random generators mentioned above.<br />
<br />
(3) '''Computational Budget''' that we need to control so that the differences in performance of different neural networks are due to their diverse relational graph structures.<br />
<br />
It is important to ensure that all networks have approximately the same complexities so that the differences in performance are due to their relational graph structures when comparing neutral work by their diverse graph.<br />
<br />
We use FLOPS (# of multiply-adds) as the metric. We first compute the FLOPS of our baseline network instantiations (i.e. complete relational graph) and use them as the reference complexity in each experiment. From the description in section 2, a relational graph structure can be instantiated as a neural network with variable width. Therefore, we can adjust the width of a neural network to match the reference complexity without changing the relational graph structures.<br />
<br />
= Experimental Setup =<br />
The author studied the performance of 3942 sampled relational graphs (generated by WS-flex from the last section) of 64 nodes with two experiments: <br />
<br />
(1) CIFAR-10 dataset: 10 classes, 50K training images, and 10K validation images<br />
<br />
Relational Graph: all 3942 sampled relational graphs of 64 nodes<br />
<br />
Studied Network: 5-layer MLP with 512 hidden units<br />
<br />
<br />
(2) ImageNet classification: 1K image classes, 1.28M training images and 50K validation images<br />
<br />
Relational Graph: Due to high computational cost, 52 graphs are uniformly sampled from the 3942 available graphs.<br />
<br />
Studied Network: <br />
*ResNet-34, which only consists of basic blocks of 3×3 convolutions (He et al., 2016)<br />
<br />
*ResNet-34-sep, a variant where we replace all 3×3 dense convolutions in ResNet-34 with 3×3 separable convolutions (Chollet, 2017)<br />
<br />
*ResNet-50, which consists of bottleneck blocks (He et al., 2016) of 1×1, 3×3, 1×1 convolutions<br />
<br />
*EfficientNet-B0 architecture (Tan & Le, 2019)<br />
<br />
*8-layer CNN with 3×3 convolution<br />
<br />
= Discussions and Conclusions =<br />
<br />
The paper summarizes the result of the experiment among multiple different relational graphs through sampling and analyzing and list six important observations during the experiments, These are:<br />
<br />
* There are always exists graph structure that has higher predictive accuracy under Top-1 error compare to the complete graph<br />
<br />
* There is a sweet spot that the graph structure near the sweet spot usually outperform the base graph<br />
<br />
* The predictive accuracy under top-1 error can be represented by a smooth function of Average Path Length <math> (L) </math> and Clustering Coefficient <math> (C) </math><br />
<br />
* The Experiments is consistent across multiple datasets and multiple graph structure with similar Average Path Length and Clustering Coefficient.<br />
<br />
* The best graph structure can be identified easily.<br />
<br />
* There is a similarity between best artificial neurons and biological neurons.<br />
<br />
----<br />
<br />
<br />
<br />
[[File:Result2_441_2020Group16.png]]<br />
<br />
$$\text{Figure - Results from Experiments}$$<br />
<br />
== Neural networks performance depends on its structure ==<br />
During the experiment, Top-1 errors for all sampled relational graph among multiple tasks and graph structures are recorded. The parameters of the models are average path length and clustering coefficient. Heat maps were created to illustrate the difference in predictive performance among possible average path length and clustering coefficient. In '''Figure - Results from Experiments (a)(c)(f)''', The darker area represents a smaller top-1 error which indicates the model performs better than the light area.<br />
<br />
Compare with the complete graph which has parameter <math> L = 1 </math> and <math> C = 1 </math>, The best performing relational graph can outperform the complete graph baseline by 1.4% top-1 error for MLP on CIFAR-10, and 0.5% to 1.2% for models on ImageNet. Hence it is an indicator that the predictive performance of the neural networks highly depends on the graph structure, or equivalently that the completed graph does not always have the best performance. <br />
<br />
== Sweet spot where performance is significantly improved ==<br />
It had been recognized that training noises often results in inconsistent predictive results. In the paper, the 3942 graphs in the sample had been grouped into 52 bins, each bin had been colored based on the average performance of graphs that fall into the bin. By taking the average, the training noises had been significantly reduced. Based on the heat map '''Figure - Results from Experiments (f)''', the well-performing graphs tend to cluster into a special spot that the paper called “sweet spot” shown in the red rectangle, the rectangle is approximately included clustering coefficient in the range <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math>.<br />
<br />
== Relationship between neural network’s performance and parameters == <br />
When we visualize the heat map, we can see that there is no significant jump of performance that occurred as a small change of clustering coefficient and average path length ('''Figure - Results from Experiments (a)(c)(f)'''). In addition, if one of the variables is fixed in a small range, it is observed that a second-degree polynomial is a good visualization tool for the overall trend ('''Figure - Results from Experiments (b)(d)'''). Therefore, both the clustering coefficient and average path length are highly related to neural network performance by a U-shape. <br />
<br />
== Consistency among many different tasks and datasets ==<br />
They observe that relational graphs with certain graph measures may consistently perform well regardless of how they are instantiated. The paper presents consistency uses two perspectives, one is qualitative consistency and another one is quantitative consistency.<br />
<br />
(1) '''Qualitative Consistency'''<br />
It is observed that the results are consistent from different points of view. Among multiple architecture dataset, it is observed that the clustering coefficient within <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math> consistently outperform the baseline complete graph. <br />
<br />
(2) '''Quantitative Consistency'''<br />
Among different dataset with the network that has similar clustering coefficient and average path length, the results are correlated, The paper mentioned that ResNet-34 is much more complex than 5-layer MLP but a fixed set relational graph would perform similarly in both settings, with Pearson correlation of <math>0.658</math>, the p-value for the Null hypothesis is less than <math>10^{-8}</math>.<br />
<br />
== Top architectures can be identified efficiently ==<br />
The computation cost of finding top architectures can be significantly reduced without training the entire data set for a large value of epoch or a relatively large sample. To achieve the top architectures, the number of graphs and training epochs need to be identified. For the number of graphs, a heatmap is a great tool to demonstrate the result. In the 5-layer MLP on CIFAR-10 example, taking a sample of the data around 52 graphs would have a correlation of 0.9, which indicates that fewer samples are needed for a similar analysis in practice. When determining the number of epochs, correlation can help to show the result. In ResNet34 on ImageNet example, the correlation between the variables is already high enough for future computation within 3 epochs. This means that good relational graphs perform well even at the<br />
initial training epochs.<br />
<br />
== Well-performing neural networks have graph structure surprisingly similar to those of real biological neural networks==<br />
The way we define relational graphs and average length in the graph is similar to the way information is exchanged in network science. The biological neural network also has a similar relational graph representation and graph measure with the best-performing relational graph.<br />
<br />
While there is some organizational similarity between a computational neural network and a biological neural network, we should refrain from saying that both these networks share many similarities or are essentially the same with just different substrates. The biological neurons are still quite poorly understood and it may take a while before their mechanisms are better understood.<br />
<br />
= Critique =<br />
<br />
1. The experiment is only measuring on a single data set which might not be representative enough. As we can see in the whole paper, the "sweet spot" we talked about might be a special feature for the given data set only which is the CIFAR-10 data set. If we change the data set to another imaging data set like CK+, whether we are going to get a similar result is not shown by the paper. Hence, the result that is being concluded from the paper might not be representative enough. <br />
<br />
2. When we are fitting the model in practice, we will fit the model with more than one epoch. The order of the model fitting should be randomized since we should create more random jumps to avoid staked inside a local minimum. With the same order within each epoch, the data might be grouped by different classes or levels, the model might result in a better performance with certain classes and worse performance with other classes. In this particular example, without randomization of the training data, the conclusion might not be precise enough.<br />
<br />
3. This study shows empirical justification for choosing well-performing models from graphs differing only by average path length and clustering coefficient. An equally important question is whether there is the theoretical justification for why these graph properties may (or may not) contribute the performance of a general classifier - for example, is there a combination of these properties that is sufficient to recover the universality theorem for MLP's?<br />
<br />
4. It might be worth looking into how to identify the "sweet spot" for different datasets.<br />
<br />
5. What would be considered a "best graph structure " in the discussion and conclusion part? It seems that the intermediate result of getting an accurate result was by binning graphs into smaller bins, what should we do if the graphs can not be binned into significantly smaller bins in order to proceed with the methodologies mentioned in the paper. Both CIFAR - 10 and ImageNet seem too trivial considering the amount of variation and categories in the dataset. What would the generalizability be to other presentation of images?</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=46455Task Understanding from Confusing Multi-task Data2020-11-25T17:52:36Z<p>T34sun: /* Application of Multi-label Learning */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, and even self-driving cars. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system, a system that allows the machine to learn from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates identifies an input to its respective task and the latter finds the relationship between the input and its label. See figure 1(b). To train a network of CSL, CSL-Net is constructed for representing CSL’s variables. However, this structure cannot be optimized by gradient back-propagation. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization. <br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine sigh complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. In multi-task learning, the task to which every sample belongs is known. With this task definition, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. Without this annotation, our goal is to understand the task concept from confusing input-label pairs. Overall, It requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, samples are generated from multiple distributions instead of one distribution. Thus, latent variable learning is insufficient to solve the research problem, which is multi-task confusing samples.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, assuming the risk measure is mean squared error (MSE), the expected risk functional is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the prior distribution of the input variable <math>x</math>. In practice, model optimizations are performed using the empirical risk<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk functional can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to recover both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of the MNIST digit data that has been colored. Each observation in this modified set consists of a colored image (<math>x_i</math>) and either the color, or the digit it represents (<math>y_i</math>). The goal is to recover the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': This data set has more observations than the "colored-MNIST" data and consists of pictures labeled with either the “Gender”, “Category”, and “Color” of the clothing item.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': In order to "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labeling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
<br />
Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reaction from text.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, there are some limitations that can be improved for future work:<br />
The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; The current algorithm is difficult to learn basic features directly through the CNN structure, and it is necessary to learn and train a fully connected network based on CNN features in the experiment.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed for converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=46454Task Understanding from Confusing Multi-task Data2020-11-25T17:52:18Z<p>T34sun: /* Application of Multi-label Learning */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, and even self-driving cars. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system, a system that allows the machine to learn from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates identifies an input to its respective task and the latter finds the relationship between the input and its label. See figure 1(b). To train a network of CSL, CSL-Net is constructed for representing CSL’s variables. However, this structure cannot be optimized by gradient back-propagation. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization. <br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine sigh complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. In multi-task learning, the task to which every sample belongs is known. With this task definition, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. Without this annotation, our goal is to understand the task concept from confusing input-label pairs. Overall, It requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, samples are generated from multiple distributions instead of one distribution. Thus, latent variable learning is insufficient to solve the research problem, which is multi-task confusing samples.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, assuming the risk measure is mean squared error (MSE), the expected risk functional is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the prior distribution of the input variable <math>x</math>. In practice, model optimizations are performed using the empirical risk<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk functional can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to recover both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of the MNIST digit data that has been colored. Each observation in this modified set consists of a colored image (<math>x_i</math>) and either the color, or the digit it represents (<math>y_i</math>). The goal is to recover the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': This data set has more observations than the "colored-MNIST" data and consists of pictures labeled with either the “Gender”, “Category”, and “Color” of the clothing item.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': In order to "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labeling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reaction from text.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, there are some limitations that can be improved for future work:<br />
The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; The current algorithm is difficult to learn basic features directly through the CNN structure, and it is necessary to learn and train a fully connected network based on CNN features in the experiment.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed for converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=46453Task Understanding from Confusing Multi-task Data2020-11-25T17:51:59Z<p>T34sun: /* Application of Multi-label Learning */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, and even self-driving cars. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system, a system that allows the machine to learn from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates identifies an input to its respective task and the latter finds the relationship between the input and its label. See figure 1(b). To train a network of CSL, CSL-Net is constructed for representing CSL’s variables. However, this structure cannot be optimized by gradient back-propagation. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization. <br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine sigh complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. In multi-task learning, the task to which every sample belongs is known. With this task definition, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. Without this annotation, our goal is to understand the task concept from confusing input-label pairs. Overall, It requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, samples are generated from multiple distributions instead of one distribution. Thus, latent variable learning is insufficient to solve the research problem, which is multi-task confusing samples.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, assuming the risk measure is mean squared error (MSE), the expected risk functional is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the prior distribution of the input variable <math>x</math>. In practice, model optimizations are performed using the empirical risk<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk functional can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to recover both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of the MNIST digit data that has been colored. Each observation in this modified set consists of a colored image (<math>x_i</math>) and either the color, or the digit it represents (<math>y_i</math>). The goal is to recover the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': This data set has more observations than the "colored-MNIST" data and consists of pictures labeled with either the “Gender”, “Category”, and “Color” of the clothing item.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': In order to "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labeling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
Application of multi-label classification includes building a recommendation system, social media targeting, as well as detecting adverse drug reaction from text.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, there are some limitations that can be improved for future work:<br />
The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; The current algorithm is difficult to learn basic features directly through the CNN structure, and it is necessary to learn and train a fully connected network based on CNN features in the experiment.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed for converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=46452Task Understanding from Confusing Multi-task Data2020-11-25T17:51:36Z<p>T34sun: /* Application of Multi-label Learning */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, and even self-driving cars. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system, a system that allows the machine to learn from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates identifies an input to its respective task and the latter finds the relationship between the input and its label. See figure 1(b). To train a network of CSL, CSL-Net is constructed for representing CSL’s variables. However, this structure cannot be optimized by gradient back-propagation. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization. <br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine sigh complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. In multi-task learning, the task to which every sample belongs is known. With this task definition, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. Without this annotation, our goal is to understand the task concept from confusing input-label pairs. Overall, It requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, samples are generated from multiple distributions instead of one distribution. Thus, latent variable learning is insufficient to solve the research problem, which is multi-task confusing samples.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, assuming the risk measure is mean squared error (MSE), the expected risk functional is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the prior distribution of the input variable <math>x</math>. In practice, model optimizations are performed using the empirical risk<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk functional can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk functional. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to recover both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of the MNIST digit data that has been colored. Each observation in this modified set consists of a colored image (<math>x_i</math>) and either the color, or the digit it represents (<math>y_i</math>). The goal is to recover the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': This data set has more observations than the "colored-MNIST" data and consists of pictures labeled with either the “Gender”, “Category”, and “Color” of the clothing item.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': In order to "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labeling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels<br />
Application of multi-label classification includes building a recommendation system, social media targeting, as well as detecting adverse drug reaction from text.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, there are some limitations that can be improved for future work:<br />
The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; The current algorithm is difficult to learn basic features directly through the CNN structure, and it is necessary to learn and train a fully connected network based on CNN features in the experiment.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed for converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification%E2%80%94%E2%80%94via_Convolution_Neural_Network&diff=46450Semantic Relation Classification——via Convolution Neural Network2020-11-25T17:17:08Z<p>T34sun: /* Critiques */</p>
<hr />
<div><br />
<br />
<br />
== Presented by ==<br />
Rui Gong, Xinqi Ling, Di Ma,Xuetong Wang<br />
<br />
== Introduction ==<br />
One of the emerging trends of natural language technologies is their use for the humanities and sciences (Gbor et al., 2018). SemEval 2018 Task 7 mainly solves the problem of relation extraction and classification of two entities in the same sentence into 6 potential relations. The 6 relations are USAGE, RESULT, MODEL-FEATURE,PART WHOLE, TOPIC, and COMPARE.<br />
<br />
Data comes from 350 scientific paper abstracts, which have 1228 and 1248 annotated sentences for two tasks. For each data, an example sentence was chosen with its right and left sentences, as well as an indicator showing whether the relation is reserved, then a prediction is made. <br />
<br />
Three models were used for the prediction: Linear Classifiers, Long Short-Term Memory(LSTM), and Convolutional Neural Networks (CNN). In the end, the CNN model category performed best, so the article specifically submitted the final submission for this model. By using the learned custom word embedding function, the research team added a variant of negative sampling, thereby improving performance and surpassing ordinary CNN.<br />
<br />
== Previous Work ==<br />
SemEval 2010 Task 8(Hendrickx et al., 2010) explored the classification of natural language relations and studied the 9 relations between word pairs. However, it is not designed for scientific text analysis. Xu et al. (2015a) and Santos et al. (2015) , both of them applied CNN with negative sampling to finish task7. The 2017 SemEval Task 10 also featured relation extraction within scientific publications.<br />
<br />
== Algorithm ==<br />
<br />
[[File:CNN.png|800px]]<br />
<br />
This is the architecture of CNN. We first transform a sentence via Feature embeddings. Basically, we transform each sentence into continuous word embeddings:<br />
<br />
$$<br />
(e^{w_i})<br />
$$<br />
<br />
And word position embeddings:<br />
$$<br />
(e^{wp_i}): e_i = [e^{w_i}, e^{wp_i}]<br />
$$<br />
<br />
In the word embeddings, we got a vocabulary ‘V’, and we will make an embedding word matrix based on the position of the word in the vocabulary. This matrix is trainable and needs to be initialized by pre-trained embedding vectors.<br />
In the word position embeddings, we first need to input some words named ‘entities’ and they are the key for the machine to determine the sentence’s relation. During this process, if we have two entities, we will use the relative position of them in the sentence to make the<br />
embeddings. We will output two vectors and one of them keeps track of the first entity relative position in the sentence ( we will make the entity recorded as 0, the former word recorded as -1 and the next one 1, etc. ). And the same procedure for the second entity. Finally, we will get two vectors concatenated as the position embedding.<br />
<br />
<br />
After the embeddings, the model will transform the embedded sentence to a fix-sized representation of the whole sentence via the convolution layer, finally after the max-pooling to reduce the dimension of the output of the layers, we will get a score for each relation class via a linear transformation.<br />
<br />
<br />
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like <br />
$$e=[e_{1},e_{2},\ldots,e_{N}]$$<br />
and each entry represents a token of the word. Also, to apply <br />
convolutional neural network, the subsets of features<br />
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$<br />
is given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to <br />
produce a new feature, defiend as <br />
$$c_{i}=\text{tanh}(W\cdot e_{i:i+k-1}+bias)$$<br />
This process is applied to all subsets of features with length <math> k </math> starting <br />
from the first one. Then a mapped feature factor <br />
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$<br />
is produced.<br />
<br />
The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.<br />
With different weight filter, different mapped feature vectors can be obtained. Finally, the original <br />
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,<br />
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.<br />
<br />
Then, the score vector <br />
$$s(x)=W^{classes}r_{x}$$<br />
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as <br />
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.<br />
<br />
To improve the performance, “Negative Sampling" was used. Given the trained data point <br />
<math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the <br />
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative <br />
distance (negative margin plus the second largest score) should be minimized. So the loss function is <br />
$$L=\log(1+e^{\gamma(m^{+}-s(x)_{y})})+\log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))})$$<br />
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.<br />
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, <br />
and 49,600 of them are unique.<br />
<br />
== Results ==<br />
In machine learning, the most important part is to tune the hyper-parameters. Unlike traditional hyper-parameter optimization, there are some<br />
modifications to the model in order to increase performance on the test set. There are 5 modifications that we can apply:<br />
<br />
1. Merged Training Sets. It combined two training sets to increase the data set<br />
size and it improves the equality between classes to get better predictions.<br />
<br />
2. Reversal Indicate Features. It added a binary feature.<br />
<br />
3. Custom ACL Embeddings. It embedded word vector to an ACL-specific<br />
corps.<br />
<br />
4. Context words. Within the sentence, it varies in size on a context window<br />
around the entity-enclosed text.<br />
<br />
5. Ensembling. It used different early stop and random initializations to improve<br />
the predictions.<br />
<br />
These modifications performances well on the training data and they are shown<br />
in table 3.<br />
<br />
[[File:table3.PNG]]<br />
<br />
<br />
<br />
As we can see the best choice for this model is ensembling. Because the random initialization made the data more nature and avoided the overfit.<br />
During the training process, there are some methods such that they can only<br />
increases the score on the cross validation test sets but hurt the performance on<br />
the overall macro-F1 score. Thus, these methods were eventually ruled out.<br />
<br />
<br />
[[File:table4.PNG]]<br />
<br />
There are six submissions in total. Three for each training set and the result<br />
is shown in figure 2.<br />
<br />
The best submission for training set 1.1 is the third submission which did not<br />
use the cross-validation as the test set. Instead, it runs a constant number of<br />
training epochs and based on the training data it can be chosen by cross-validation. The best submission for the training set 1.2 is the first submission which<br />
extracted 10% of the training data as validation accuracy on the test set predictions.<br />
All in all, earlystoping cannot always based on the accuracy of the validation set<br />
since it cannot guarantee to get better performance on the real test set. Thus,<br />
we have to try new approaches and combine them together to see the prediction<br />
results. Also, doing stratification will certainly to improve the performance on<br />
the test data.<br />
<br />
== Conclusions ==<br />
Throughout the process, linear classifiers, sequential random forest, LSTM, and CNN models are tested. Variations are applied to the models. Among all variations, vanilla CNN with negative sampling and ACL-embedding has significant better performance than all others. Attention based pooling, up-sampling and data augmentation are also tested, but they barely perform positive increment on the behavior.<br />
<br />
== Critiques == <br />
<br />
- Applying this in news apps might be beneficial to improve readability by highlighting specific important sections.<br />
<br />
- In the section of previous work, the author mentioned 9 natural language relationship between the word pairs. Among them, 6 potential relationships are USAGE, RESULT, MODEL-FEATURE,PART WHOLE, TOPIC, and COMPARE. It would help the readers to better understand if all 9 relationships are listed in the summary.<br />
<br />
== References ==<br />
Diederik P Kingma and Jimmy Ba. 2014. Adam: A<br />
method for stochastic optimization. arXiv preprint<br />
arXiv:1412.6980.<br />
<br />
DragomirR. Radev, Pradeep Muthukrishnan, Vahed<br />
Qazvinian, and Amjad Abu-Jbara. 2013. The ACL<br />
anthology network corpus. Language Resources<br />
and Evaluation, pages 1–26.<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey<br />
Dean. 2013a. Efficient estimation of word<br />
representations in vector space. arXiv preprint<br />
arXiv:1301.3781.<br />
<br />
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,<br />
and Jeff Dean. 2013b. Distributed representations<br />
of words and phrases and their compositionality.<br />
In Advances in neural information processing<br />
systems, pages 3111–3119.<br />
<br />
Kata Gbor, Davide Buscaldi, Anne-Kathrin Schumann, Behrang QasemiZadeh, Hafa Zargayouna,<br />
and Thierry Charnois. 2018. Semeval-2018 task 7:Semantic relation extraction and classification in scientific papers. <br />
In Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval2018), New Orleans, LA, USA, June 2018.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:T358wang&diff=46252User:T358wang2020-11-24T21:55:30Z<p>T34sun: </p>
<hr />
<div><br />
== Group ==<br />
Rui Chen, Zeren Shen, Zihao Guo, Taohao Wang<br />
<br />
== Introduction ==<br />
<br />
Landmark recognition is an image retrieval task with its own specific challenges. This paper provides a new and effective method to recognize landmark images, which has been successfully applied to actual images. In this way, statues, buildings, and characteristic objects can be effectively identified.<br />
<br />
There are many difficulties encountered in the development process:<br />
<br />
1. The first problem is that the concept of landmarks cannot be strictly defined, because landmarks can be any object and building.<br />
<br />
2. The second problem is that the same landmark can be photographed from different angles. The results of the multi-angle shooting will result in very different picture characteristics. But the system needs to accurately identify different landmarks. We may also need to consider angles which capture the interior of a building versus the exterior of it, a good model will be able to recognize both.<br />
<br />
3. The third problem is that the landmark recognition system must recognize a large number of landmarks, and the recognition must achieve high accuracy. The challenge here is that there are significantly more objects in the world that are not landmarks.<br />
<br />
These problems require that the system should have a very low false alarm rate and high recognition accuracy. <br />
There are also three potential problems:<br />
<br />
1. The processed data set contains a little error content, the image content is not clean and the quantity is huge.<br />
<br />
2. The algorithm for learning the training set must be fast and scalable.<br />
<br />
3. While displaying high-quality judgment landmarks, there is no image geographic information mixed.<br />
<br />
The article describes the deep convolutional neural network (CNN) architecture, loss function, training method, and inference aspects. Using this model, similar metrics to the state of the art model in the test were obtained and the inference time was found to be 15 times faster. Further, because of the efficient architecture, the system can serve in an online fashion. The results of quantitative experiments will be displayed through testing and deployment effect analysis to prove the effectiveness of the model.<br />
<br />
== Related Work ==<br />
<br />
Landmark recognition can be regarded as one of the tasks of image retrieval, and a large number of documents are concentrated on image retrieval tasks. In the past two decades, the field of image retrieval has made significant progress, and the main methods can be divided into two categories. <br />
The first is a classic retrieval method using local features, a method based on local feature descriptors organized in bag-of-words(A bag of words model is defined as a simplified representation of the text information by retrieve only the significant words in a sentence or paragraph while disregarding its grammar.The bag of words approach is commonly used in classification where the words are used as features in the model-training), spatial verification, Hamming embedding, and query expansion. These methods are dominant in image retrieval. Later, until the rise of deep convolutional neural networks (CNN), deep convolutional neural networks (CNN) were used to generate global descriptors of input images.<br />
Another method is to selectively match the kernel Hamming embedding method extension. With the advent of deep convolutional neural networks, the most effective image retrieval method is based on training CNNs for specific tasks. Deep networks are very powerful for semantic feature representation, which allows us to effectively use them for landmark recognition. This method shows good results but brings additional memory and complexity costs. <br />
The DELF (DEep local feature) by Noh et al. proved promising results. This method combines the classic local feature method with deep learning. This allows us to extract local features from the input image and then use RANSAC for geometric verification. Random Sample Consensus (RANSAC) is a method to smooth data containing a significant percentage of errors, which is ideally suited for applications in automated image analysis where interpretation is based on the data generated by error-prone feature detectors. The goal of the project is to describe a method for accurate and fast large-scale landmark recognition using the advantages of deep convolutional neural networks.<br />
<br />
== Methodology ==<br />
<br />
This section will describe in detail the CNN architecture, loss function, training procedure, and inference implementation of the landmark recognition system. The figure below is an overview of the landmark recognition system.<br />
<br />
[[File:t358wang_landmark_recog_system.png |center|800px]]<br />
<br />
The landmark CNN consists of three parts including the main network, the embedding layer, and the classification layer. To obtain a CNN main network suitable for training landmark recognition model, fine-tuning is applied and several pre-trained backbones (Residual Networks) based on other similar datasets, including ResNet-50, ResNet-200, SE-ResNext-101, and Wide Residual Network (WRN-50-2), are evaluated based on inference quality and efficiency. Based on the evaluation results, WRN-50-2 is selected as the optimal backbone architecture. Fine-tuning is a very efficient technique in various computer vision applications, because we can take advantage of everything the model has already learned and apply it to our specific task.<br />
<br />
[[File:t358wang_backbones.png |center|600px]]<br />
<br />
For the embedding layer, as shown in the below figure, the last fully-connected layer after the averaging pool is removed. Instead, a fully-connected 2048 <math>\times</math> 512 layer and a batch normalization are added as the embedding layer. After the batch norm, a fully-connected 512 <math>\times</math> n layer is added as the classification layer. The below figure shows the overview of the CNN architecture of the landmark recognition system.<br />
<br />
[[File:t358wang_network_arch.png |center|800px]]<br />
<br />
To effectively determine the embedding vectors for each landmark class (centroids), the network needs to be trained to have the members of each class to be as close as possible to the centroids. Several suitable loss functions are evaluated including Contrastive Loss, Arcface, and Center loss. The center loss is selected since it achieves the optimal test results and it trains a center of embeddings of each class and penalizes distances between image embeddings as well as their class centers. In addition, the center loss is a simple addition to softmax loss and is trivial to implement.<br />
<br />
When implementing the loss function, a new additional class that includes all non-landmark instances needs to be added and the center loss function needs to be modified as follows: Let n be the number of landmark classes, m be the mini-batch size, <math>x_i \in R^d</math> is the i-th embedding and <math>y_i</math> is the corresponding label where <math>y_i \in</math> {1,...,n,n+1}, n+1 is the label of the non-landmark class. Denote <math>W \in R^{d \times n}</math> as the weights of the classifier layer, <math>W_j</math> as its j-th column. Let <math>c_{y_i}</math> be the <math>y_i</math> th embeddings center from Center loss and <math>\lambda</math> be the balancing parameter of Center loss. Then the final loss function will be: <br />
<br />
[[File:t358wang_loss_function.png |center|600px]]<br />
<br />
In the training procedure, the stochastic gradient descent(SGD) will be used as the optimizer with momentum=0.9 and weight decay = 5e-3. For the center loss function, the parameter <math>\lambda</math> is set to 5e-5. Each image is resized to 256 <math>\times</math> 256 and several data augmentations are applied to the dataset including random resized crop, color jitter and random flip. The training dataset is divided into four parts based on the geographical affiliation of cities where landmarks are located: Europe/Russia, North America/Australia/Oceania, Middle East/North Africa, and the Far East. <br />
<br />
The paper introduces curriculum learning for landmark recognition, which is shown in the below figure. The algorithm is trained for 30 epochs and the learning rate <math>\alpha_1, \alpha_2, \alpha_3</math> will be reduced by a factor of 10 at the 12th epoch and 24th epoch.<br />
<br />
[[File:t358wang_algorithm1.png |center|600px]]<br />
<br />
In the inference phase, the paper introduces the term “centroids” which are embedding vectors that are calculated by averaging embeddings and are used to describe landmark classes. The calculation of centroids is significant to effectively determine whether a query image contains a landmark. The paper proposes two approaches to help the inference algorithm to calculate the centroids. First, instead of using the entire training data for each landmark, data cleaning is done to remove most of the redundant and irrelevant elements in the image. Second, since each landmark can have different shooting angles, it is more efficient to calculate a separate centroid for each shooting angle. Hence, a hierarchical agglomerative clustering algorithm is proposed to partition training data into several valid clusters for each landmark and the set of centroids for a landmark L can be represented by <math>\mu_{l_j} = \frac{1}{v} \sum_{i \in C_j} x_i, j \in 1,...,v</math> where v is the number of valid clusters for landmark L. <br />
<br />
Once the centroids are calculated for each landmark class, the system can make decisions whether there is any landmark in an image. The query image is passed through the landmark CNN and the resulting embedding vector is compared with all centroids by dot product similarity using approximate k-nearest neighbors (AKNN). To distinguish landmark classes from non-landmark, a threshold <math>\eta</math> is set and it will be compared with the maximum similarity to determine if the image contains any landmarks.<br />
<br />
The full inference algorithm is described in the below figure.<br />
<br />
[[File:t358wang_algorithm2.png |center|600px]]<br />
<br />
== Experiments and Analysis ==<br />
<br />
'''Offline test'''<br />
<br />
In order to measure the quality of the model, an offline test set was collected and manually labeled. According to the calculations, photos containing landmarks make up 1 − 3% of the total number of photos on average. This distribution was emulated in an offline test, and the geo-information and landmark references weren’t used. <br />
The results of this test are presented in the table below. Two metrics were used to measure the results of experiments: Sensitivity — the accuracy of a model on images with landmarks (also called Recall) and Specificity — the accuracy of a model on images without landmarks. Several types of DELF were evaluated, and the best results in terms of sensitivity and specificity were included in the table below. The table also contains the results of the model trained only with Softmax loss, Softmax, and Center loss. Thus, the table below reflects improvements in our approach with the addition of new elements in it.<br />
<br />
[[File:t358wang_models_eval.png |center|600px]]<br />
<br />
It’s very important to understand how a model works on “rare” landmarks due to the small amount of data for them. Therefore, the behavior of the model was examined separately on “rare” and “frequent” landmarks in the table below. The column “Part from total number” shows what percentage of landmark examples in the offline test has the corresponding type of landmarks. And we find that the sensitivity of “frequent” landmarks is much higher than “rare” landmarks.<br />
<br />
[[File:t358wang_rare_freq.png |center|600px]]<br />
<br />
Analysis of the behavior of the model in different categories of landmarks in the offline test is presented in the table below. These results show that the model can successfully work with various categories of landmarks. Predictably better results (92% of sensitivity and 99.5% of specificity) could also be obtained when the offline test with geo-information was launched on the model.<br />
<br />
[[File:t358wang_landmark_category.png |center|600px]]<br />
<br />
'''Revisited Paris dataset'''<br />
<br />
Revisited Paris dataset (RPar)[2] was also used to measure the quality of the landmark recognition approach. This dataset with Revisited Oxford (ROxf) is standard benchmarks for the comparison of image retrieval algorithms. In recognition, it is important to determine the landmark, which is contained in the query image. Images of the same landmark can have different shooting angles or taken inside/outside the building. Thus, it is reasonable to measure the quality of the model in the standard and adapt it to the task settings. That means not all classes from queries are presented in the landmark dataset. For those images containing correct landmarks but taken from different shooting angles within the building, we transferred them to the “junk” category, which does not influence the final score and makes the test markup closer to our model’s goal. Results on RPar with and without distractors in medium and hard modes are presented in the table below. <br />
<br />
[[File:t358wang_methods_eval1.png |center|600px]]<br />
<br />
[[File:t358wang_methods_eval2.png |center|600px]]<br />
<br />
== Comparison ==<br />
<br />
Recent most efficient approaches to landmark recognition are built on fine-tuned CNN. We chose to compare our method to DELF on how well each performs on recognition tasks. A brief summary is given below:<br />
<br />
[[File:t358wang_comparison.png |center|600px]]<br />
<br />
''' Offline test and timing '''<br />
<br />
Both approaches obtained similar results for image retrieval in the offline test (shown in the sensitivity&specificity table), but the proposed approach is much faster on the inference stage and more memory efficient.<br />
<br />
To be more detailed, during the inference stage, DELF needs more forward passes through CNN, has to search the entire database, and performs the RANSAC method for geometric verification. All of them make it much more time-consuming than our proposed approach. Our approach mainly uses centroids, this makes it take less time and needs to store fewer elements.<br />
<br />
== Conclusion ==<br />
<br />
In this paper we were hoping to solve some difficulties that emerge when trying to apply landmark recognition to the production level: there might not be a clean & sufficiently large database for interesting tasks, algorithms should be fast, scalable, and should aim for low FP and high accuracy.<br />
<br />
While aiming for these goals, we presented a way of cleaning landmark data. And most importantly, we introduced the usage of embeddings of deep CNN to make recognition fast and scalable, trained by curriculum learning techniques with modified versions of Center loss. Compared to the state-of-the-art methods, this approach shows similar results but is much faster and suitable for implementation on a large scale.<br />
<br />
== Critique ==<br />
The paper selected 5 images per landmark and checked them manually. That means the training process takes a long time on data cleaning and so the proposed algorithm lacks reusability. Also, since only the landmarks that are the largest and most popular were used to train the CNN, the trained model will probably be most useful in big cities instead of smaller cities with less popular landmarks.<br />
<br />
== References ==<br />
[1] Andrei Boiarov and Eduard Tyantov. 2019. Large Scale Landmark Recognition via Deep Metric Learning. In The 28th ACM International Conference on Information and Knowledge Management (CIKM ’19), November 3–7, 2019, Beijing, China. ACM, New York, NY, USA, 10 pages. https://arxiv.org/pdf/1908.10192.pdf 3357384.3357956<br />
<br />
[2] FilipRadenović,AhmetIscen,GiorgosTolias,YannisAvrithis,andOndřejChum.<br />
2018. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking.<br />
arXiv preprint arXiv:1803.11285 (2018).</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Cvmustat&diff=46250User:Cvmustat2020-11-24T21:43:38Z<p>T34sun: </p>
<hr />
<div><br />
== Combine Convolution with Recurrent Networks for Text Classification == <br />
'''Team Members''': Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea<br />
<br />
'''Date''': Week of Nov 23 <br />
<br />
== Introduction ==<br />
<br />
<br />
Text classification is the task of assigning a set of predefined categories to natural language texts. It is a fundamental task in Natural Language Processing (NLP) with various applications such as sentiment analysis, and topic classification. A classic example involving text classification is given a set of News articles, is it possible to classify the genre or subject of each article? Text classification is useful as text data is a rich source of information, but extracting insights from it directly can be difficult and time-consuming as most text data is unstructured.[1] NLP text classification can help automatically structure and analyze text, quickly and cost-effectively, allowing for individuals to extract important features from the text easier than before. <br />
<br />
Text classification work mainly focuses on three topics: feature engineering, feature selection, and the use of different types of machine learning algorithms.<br />
1. Feature engineering, the most widely used feature is the bag of words feature. Some more complex functions are also designed, such as part-of-speech tags, noun phrases and tree kernels.<br />
2. Feature selection aims to remove noisy features and improve classification performance. The most common feature selection method is to delete stop words.<br />
3. Machine learning algorithms usually use classifiers, such as Logistic Regression (LR), Naive Bayes (NB) and Support Vector Machine (SVM).<br />
<br />
In practice, pre-trained word embeddings and deep neural networks are used together for NLP text classification. Word embeddings are used to map the raw text data to an implicit space where the semantic relationships of the words are preserved; words with similar meaning have a similar representation. One can then feed these embeddings into deep neural networks to learn different features of the text. Convolutional neural networks can be used to determine the semantic composition of the text(the meaning), as it treats texts as a 2D matrix by concatenating the embedding of words together, and it is able to capture both local and position invariant features of the text.[2] Alternatively, Recurrent Neural Networks can be used to determine the contextual meaning of each word in the text (how each word relates to one another) by treating the text as sequential data and then analyzing each word separately. [3] Previous approaches to attempt to combine these two neural networks to incorporate the advantages of both models involve streamlining the two networks, which might decrease their performance. In addition, most methods incorporating a bi-directional Recurrent Neural Network usually concatenate the forward and backward hidden states at each time step, which results in a vector that does not have the interaction information between the forward and backward hidden states.[4] The hidden state in one direction contains only the contextual meaning in that particular direction, however a word's contextual representation, intuitively, is more accurate when collected and viewed from both directions. This paper argues that the failure to observe the meaning of a word in both directions causes the loss of the true meaning of the word, especially for polysemic words (words with more than one meaning) that are context-sensitive.<br />
<br />
== Paper Key Contributions ==<br />
<br />
This paper suggests an enhanced method of text classification by proposing a new way of combining Convolutional and Recurrent Neural Networks involving the addition of a neural tensor layer. The proposed method maintains each network's respective strengths that are normally lost in previous combination methods. The new suggested architecture is called CRNN, which utilizes both a CNN and RNN that run in parallel on the same input sentence. The CNN uses weight matrix learning and produces a 2D matrix that shows the importance of each word based on local and position-invariant features. The bidirectional RNN produces a matrix that learns each word's contextual representation; the words' importance in relation to the rest of the sentence. A neural tensor layer is introduced on top of the RNN to obtain the fusion of bi-directional contextual information surrounding a particular word. This method combines these two matrix representations and classifies the text, providing the importance information of each word for prediction, which helps to explain the results. The model also uses dropout and L2 regularization to prevent overfitting.<br />
<br />
== CRNN Results vs Benchmarks ==<br />
<br />
In order to benchmark the performance of the CRNN model, as well as compare it to other previous efforts, multiple datasets and classification problems were used. All of these datasets are publicly available and can be easily downloaded by any user for testing.<br />
<br />
'''Movie Reviews:''' a sentiment analysis dataset, with two classes (positive and negative).<br />
<br />
'''Yelp:''' a sentiment analysis dataset, with five classes. For this test, a subset of 120,000 reviews was randomly chosen from each class for a total of 600,000 reviews.<br />
<br />
'''AG's News:''' a news categorization dataset, using only the 4 largest classes from the dataset.<br />
<br />
'''20 Newsgroups:''' a news categorization dataset, again using only 4 large classes from the dataset.<br />
<br />
'''Sogou News:''' a Chinese news categorization dataset, using the 4 largest classes from the dataset.<br />
<br />
'''Yahoo! Answers:''' a topic classification dataset, with 10 classes.<br />
<br />
For the English language datasets, the initial word representations were created using the publicly available ''word2vec'' [https://code.google.com/p/word2vec/] from Google news. For the Chinese language dataset, ''jieba'' [https://github.com/fxsjy/jieba] was used to segment sentences, and then 50-dimensional word vectors were trained on Chinese ''wikipedia'' using ''word2vec''.<br />
<br />
A number of other models are run against the same data after preprocessing, to obtain the following results:<br />
<br />
[[File:table of results.png|550px]]<br />
<br />
The bold results represent the best performing model for a given dataset. These results show that the CRNN model manages to be the best performing in 4 of the 6 datasets, with the Self-attentive LSTM beating the CRNN by 0.03 and 0.12 on the news categorization problems. Considering that the CRNN model has better performance than the Self-attentive LSTM on the other 4 datasets, this suggests that the CRNN model is a better performer overall in the conditions of this benchmark.<br />
<br />
Another important result was that the CRNN model filter size impacted performance only in the sentiment analysis datasets, as seen in the following:<br />
<br />
[[File:filter_effects.png|550px]]<br />
<br />
== CRNN Model Architecture ==<br />
<br />
The CRNN model is a combination of RNN and CNN. It uses CNN to compute the importance of each word in the text and utilizes a neural tensor layer to fuse forward and backward hidden states of bi-directional RNN.<br />
<br />
The input of the network is a text, which is a sequence of words. The output of the network is the text representation that is subsequently used as input of a fully-connected layer to obtain the class prediction.<br />
<br />
'''RNN Pipeline:'''<br />
<br />
The goal of the RNN pipeline is to input each word in a text, and retrieve the contextual information surrounding the word and compute the contextual representation of the word itself. This is accomplished by use of a bi-directional RNN, such that a Neural Tensor Layer (NTL) can combine the results of the RNN to obtain the final output. RNNs are well-suited to NLP tasks because of their ability to sequentially process data such as ordered text.<br />
<br />
A RNN is similar to a feed-forward neural network, but it relies on the use of hidden states. Hidden states are layers in the neural net that produce two outputs: <math> \hat{y}_{t} </math> and <math> h_t </math>. For a time step <math> t </math>, <math> h_t </math> is fed back into the layer to compute <math> \hat{y}_{t+1} </math> and <math> h_{t+1} </math>. <br />
<br />
The pipeline will actually use a variant of RNN called GRU, short for Gated Recurrent Units. This is done to address the vanishing gradient problem which causes the network to struggle memorizing words that came earlier in the sequence. Traditional RNNs are only able to remember the most recent words in a sequence, which may be problematic since words that came at the beginning of the sequence that are important to the classification problem may be forgotten. A GRU attempts to solve this by controlling the flow of information through the network using update and reset gates. <br />
<br />
Let <math>h_{t-1} \in \mathbb{R}^m, x_t \in \mathbb{R}^d </math> be the inputs, and let <math>\mathbf{W}_z, \mathbf{W}_r, \mathbf{W}_h \in \mathbb{R}^{m \times d}, \mathbf{U}_z, \mathbf{U}_r, \mathbf{U}_h \in \mathbb{R}^{m \times m}</math> be trainable weight matrices. Then the following equations describe the update and reset gates:<br />
<br />
<math><br />
z_t = \sigma(\mathbf{W}_zx_t + \mathbf{U}_zh_{t-1}) \text{update gate} \\<br />
r_t = \sigma(\mathbf{W}_rx_t + \mathbf{U}_rh_{t-1}) \text{reset gate} \\<br />
\tilde{h}_t = \text{tanh}(\mathbf{W}_hx_t + r_t \circ \mathbf{U}_hh_{t-1}) \text{new memory} \\<br />
h_t = (1-z_t)\circ \tilde{h}_t + z_t\circ h_{t-1}<br />
</math><br />
<br />
Note that <math> \sigma, \text{tanh}, \circ </math> are all element-wise functions. The above equations do the following:<br />
<br />
<ol><br />
<li> <math>h_{t-1}</math> carries information from the previous iteration and <math>x_t</math> is the current input </li><br />
<li> the update gate <math>z_t</math> controls how much past information should be forwarded to the next hidden state </li><br />
<li> the rest gate <math>r_t</math> controls how much past information is forgotten or reset </li><br />
<li> new memory <math>\tilde{h}_t</math> contains the relevant past memory as instructed by <math>r_t</math> and current information from the input <math>x_t</math> </li><br />
<li> then <math>z_t</math> is used to control what is passed on from <math>h_{t-1}</math> and <math>(1-z_t)</math> controls the new memory that is passed on<br />
</ol><br />
<br />
We treat <math>h_0</math> and <math> h_{n+1} </math> as zero vectors in the method. Thus, each <math>h_t</math> can be computed as above to yield results for the bi-directional RNN. The resulting hidden states <math>\overrightarrow{h_t}</math> and <math>\overleftarrow{h_t}</math> contain contextual information around the <math> t</math>-th word in forward and backward directions respectively. Contrary to convention, instead of concatenating these two vectors, it is argued that the word's contextual representation is more precise when the context information from different directions is collected and fused using a neural tensor layer as it permits greater interactions among each element of hidden states. Using these two vectors as input to the neural tensor layer, <math>V^i </math>, we compute a new representation that aggregates meanings from the forward and backward hidden states more accurately as follows:<br />
<br />
<math> <br />
[\hat{h_t}]_i = tanh(\overrightarrow{h_t}V^i\overleftarrow{h_t} + b_i) <br />
</math><br />
<br />
Where <math>V^i \in \mathbb{R}^{m \times m} </math> is the learned tensor layer, and <math> b_i \in \mathbb{R} </math> is the bias.We repeat this <math> m </math> times with different <math>V^i </math> matrices and <math> b_i </math> vectors. Through the neural tensor layer, each element in <math> [\hat{h_t}]_i </math> can be viewed as a different type of intersection between the forward and backward hidden states. In the model, <math> [\hat{h_t}]_i </math> will have the same size as the forward and backward hidden states. We then concatenate the three hidden states vectors to form a new vector that summarizes the context information :<br />
<math><br />
\overleftrightarrow{h_t} = [\overrightarrow{h_t}^T,\overleftarrow{h_t}^T,\hat{h_t}]^T <br />
</math><br />
<br />
We calculate this vector for every word in the text and then stack them all into matrix <math> H </math> with shape <math>n</math>-by-<math>3m</math>.<br />
<br />
<math><br />
H = [\overleftrightarrow{h_1};...\overleftrightarrow{h_n}]<br />
</math><br />
<br />
This <math>H</math> matrix is then forwarded as the results from the Recurrent Neural Network.<br />
<br />
<br />
'''CNN Pipeline:'''<br />
<br />
The goal of the CNN pipeline is to learn the relative importance of words in an input sequence based on different aspects. The process of this CNN pipeline is summarized as the following steps:<br />
<br />
<ol><br />
<li> Given a sequence of words, each word is converted into a word vector using the word2vec algorithm which gives matrix X. <br />
</li><br />
<br />
<li> Word vectors are then convolved through the temporal dimension with filters of various sizes (ie. different K) with learnable weights to capture various numerical K-gram representations. These K-gram representations are stored in matrix C.<br />
</li><br />
<br />
<ul><br />
<li> The convolution makes this process capture local and position-invariant features. Local means the K words are contiguous. Position-invariant means K contiguous words at any position are detected in this case via convolution.<br />
<br />
<li> Temporal dimension example: convolve words from 1 to K, then convolve words 2 to K+1, etc<br />
</li><br />
</ul><br />
<br />
<li> Since not all K-gram representations are equally meaningful, there is a learnable matrix W which takes the linear combination of K-gram representations to more heavily weigh the more important K-gram representations for the classification task.<br />
</li><br />
<br />
<li> Each linear combination of the K-gram representations gives the relative word importance based on the aspect that the linear combination encodes.<br />
</li><br />
<br />
<li> The relative word importance vs aspect gives rise to an interpretable attention matrix A, where each element says the relative importance of a specific word for a specific aspect.<br />
</li><br />
<br />
</ol><br />
<br />
[[File:Group12_Figure1.png |center]]<br />
<br />
<div align="center">Figure 1: The architecture of CRNN.</div><br />
<br />
== Merging RNN & CNN Pipeline Outputs ==<br />
<br />
The results from both the RNN and CNN pipeline can be merged by simply multiplying the output matrices. That is, we compute <math>S=A^TH</math> which has shape <math>z \times 3m</math> and is essentially a linear combination of the hidden states. The concatenated rows of S results in a vector in <math>\mathbb{R}^{3zm}</math>, and can be passed to a fully connected Softmax layer to output a vector of probabilities for our classification task. <br />
<br />
To train the model, we make the following decisions:<br />
<ul><br />
<li> Use cross-entropy loss as the loss function (A cross-entropy loss function usually takes in two distributions, a true distribution p and an estimated distribution q, and measures the average number of bits need to identify an event. This calculation is independent of the kind of layers used in the network as well as the kind of activation being implemented.) </li><br />
<li> Perform dropout on random columns in matrix C in the CNN pipeline </li><br />
<li> Perform L2 regularization on all parameters </li><br />
<li> Use stochastic gradient descent with a learning rate of 0.001 </li><br />
</ul><br />
<br />
== Interpreting Learned CRNN Weights ==<br />
<br />
Recall that attention matrix A essentially stores the relative importance of every word in the input sequence for every aspect chosen. Naturally, this means that A is an n-by-z matrix, with n being the number of words in the input sequence and z being the number of aspects considered in the classification task. <br />
<br />
Furthermore, for any specific aspect, words with higher attention values are more important relative to other words in the same input sequence. likewise, for any specific word, aspects with higher attention values prioritize the specific word more than other aspects.<br />
<br />
For example, in this paper, a sentence is sampled from the Movie Reviews dataset and the transpose of attention matrix A is visualized. Each word represents an element in matrix A, the intensity of red represents the magnitude of an attention value in A, and each sentence is the relative importance of each word for a specific context. In the first row, the words are weighted in terms of a positive aspect, in the last row, the words are weighted in terms of a negative aspect, and in the middle row, the words are weighted in terms of a positive and negative aspect. Notice how the relative importance of words is a function of the aspect.<br />
<br />
[[File:Interpretation example.png|800px]]<br />
<br />
== Conclusion & Summary ==<br />
<br />
This paper proposed a new architecture, the Convolutional Recurrent Neural Network, for text classification. The Convolutional Neural Network is used to learn the relative importance of each word from different aspects and stores it into a weight matrix. The Recurrent Neural Network learns each word's contextual representation through the combination of the forward and backward context information that is fused using a neural tensor layer and is stored as a matrix. These two matrices are then combined to get the text representation used for classification. Although the specifics of the performed tests are lacking, the experiment's results indicate that the proposed method performed well in comparison to most previous methods. In addition to performing well, the proposed method also provides insight into which words contribute greatly to the classification decision as the learned matrix from the Convolutional Neural Network stores the relative importance of each word. This information can then be used in other applications or analysis. In the future, one can explore the features extracted from the model and use them to potentially learn new methods such as model space. [5]<br />
<br />
== Critiques ==<br />
<br />
In the '''Method''' section of the paper, some explanations used the same notation for multiple different elements of the model. This made the paper harder to follow and understand since they were referring to different elements by identical notation.<br />
<br />
In '''Comparison of Methods''', the authors discuss the range of hyperparameter settings that they search through. While some of the hyperparameters have a large range of search values, three parameters are fixed without much explanation as to why for all experiments, size of the hidden state of GRU, number of layers, and dropout. These parameters have a lot to do with the complexity of the model and this paper could be improved by providing relevant reasoning behind these values, or by providing additional experimental results over different values of these parameters.<br />
<br />
In the '''Results''' section of the paper, they tried to show that the classification results from the CRNN model can be better interpreted than other models. In these explanations, the details were lacking and the authors did not adequately demonstrate how their model is better than others.<br />
<br />
Finally, in the '''Results''' section again, the paper compares the CRNN model to several models which they did not implement and reproduce results with. This can be seen in the chart of results above, where several models do not have entries in the table for all six datasets. Since the authors used a subset of the datasets, these other models which were not reproduced could have different accuracy scores if they had been tested on the same data as the CRNN model. This difference in training and testing data is not mentioned in the paper, and the conclusion that the CRNN model is better in all cases may not be valid.<br />
<br />
- Could this be applied to hieroglyphs to decipher/better understand them?<br />
<br />
== References ==<br />
----<br />
<br />
[1] Grimes, Seth. “Unstructured Data and the 80 Percent Rule.” Breakthrough Analysis, 1 Aug. 2008, breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/.<br />
<br />
[2] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,”<br />
arXiv preprint arXiv:1404.2188, 2014.<br />
<br />
[3] K. Cho, B. V. Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning<br />
phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint<br />
arXiv:1406.1078, 2014.<br />
<br />
[4] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in Proceedings<br />
of AAAI, 2015, pp. 2267–2273.<br />
<br />
[5] H. Chen, P. Tio, A. Rodan, and X. Yao, “Learning in the model space for cognitive fault diagnosis,” IEEE<br />
Transactions on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 124–136, 2014.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou&diff=46249User:J46hou2020-11-24T21:27:20Z<p>T34sun: </p>
<hr />
<div>DROCC: Deep Robust One-Class Classification<br />
== Presented by == <br />
Jinjiang Lian, Yisheng Zhu, Jiawen Hou, Mingzhe Huang<br />
== Introduction ==<br />
In this work, we study “one-class” classification, where the goal is to obtain accurate discriminators for a special class. Popular uses of this technique include anomaly detection where we are interested in detecting outliers. Anomaly detection is a well-studied area of research, however, the conventional approach of modeling with typical data using a simple function falls short when it comes to complex domains such as vision or speech. Another case where this would be useful is when recognizing “wake-word” while waking up AI systems such as Alexa. In this work, we are presenting a new approach called Deep Robust One-Class Classification (DROCC). DROCC is based on the assumption that the points from the class of interest lie on a well-sampled, locally linear low dimensional manifold. More specifically, we are presenting DROCC-LF which is an outlier-exposure style extension of DROCC. This extension combines the DROCC's anomaly detection loss with standard classification loss over the negative data.<br />
<br />
== Previous Work ==<br />
The current state of art methodology to tackle this kind of problems are: <br />
<br />
1. Approach based on prediction transformations (Golan & El-Yaniv, 2018; Hendrycks et al.,2019a) [1] This approach has some shortcomings in the sense that it depends heavily on an appropriate domain-specific set of transformations that are in general hard to obtain. <br />
<br />
2. Approach of minimizing a classical one-class loss on the learned final layer representations such as DeepSVDD. (Ruff et al.,2018)[2] This method suffers from the fundamental drawback of representation collapse where the model is no longer being able to accurately recognize the feature representations.<br />
<br />
3. Approach based on balancing unbalanced training data sets using methods such as SMOTE to synthetically create outlier data to train models on.<br />
<br />
== Motivation ==<br />
Anomaly detection is a well-studied problem with a large body of research (Aggarwal, 2016; Chandola et al., 2009) [3]. Classical approaches for anomaly detection are based on modeling the typical data using simple functions over the low-dimensional subspace or a tree-structured partition of the input space to detect anomalies (Sch¨olkopf et al., 1999; Liu et al., 2008; Lakhina et al., 2004) [4], such as constructing a minimum-enclosing ball around the typical data points (Tax & Duin, 2004) [5]. They broadly fall into three categories: AD via generative modeling, Deep Once Class SVM, Transformations based methods. While these techniques are well-suited when the input is featured appropriately, they struggle on complex domains like vision and speech, where hand-designing features is difficult.<br />
DROCC is robust to representation collapse by involving a discriminative component that is general and empirically accurate on most standard domains like tabular, time-series and vision without requiring any additional side information. DROCC is motivated by the key observation that generally, the typical data lies on a low-dimensional manifold, which is well-sampled in the training data. This is believed to be true even in complex domains such as vision, speech, and natural language (Pless & Souvenir, 2009). [6]<br />
<br />
== Model Explanation ==<br />
[[File:drocc_f1.jpg | center]]<br />
<div align="center">Figure 1</div><br />
(a): A normal data manifold with red dots representing generated anomalous points in Ni(r). <br />
<br />
(b): Decision boundary learned by DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. <br />
<br />
(c), (d): First two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.<br />
<br />
== DROCC ==<br />
The model is based on the assumption that the true data lines on a manifold. As manifolds resemble Euclidean space locally, our discriminative component is based on classifying a point as anomalous if it is outside the union of small L2 norm balls around the training typical points (See Figure 1a, 1b for an illustration). Importantly, the above definition allows us to synthetically generate anomalous points, and we adaptively generate the most effective anomalous points while training via a gradient ascent phase reminiscent of adversarial training. In other words, DROCC has a gradient ascent phase to adaptively add anomalous points to our training set and a gradient descent phase to minimize the classification loss by learning a representation and a classifier on top of the representations to separate typical points from the generated anomalous points. In this way, DROCC automatically learns an appropriate representation (like DeepSVDD) but is robust to a representation collapse as mapping all points to the same value would lead to poor discrimination between normal points and the generated anomalous points.<br />
== DROCC-LF ==<br />
To especially tackle problems such as anomaly detection and outlier exposure (Hendrycks et al., 2019a) [7] We propose DROCC–LF, an outlier-exposure style extension of DROCC. Intuitively, DROCC–LF combines DROCC’s anomaly detection loss (that is over only the positive data points) with standard classification loss over the negative data. But, in addition, DROCC–LF exploits the negative examples to learn a Mahalanobis distance to compare points over the manifold instead of using the standard Euclidean distance, which can be inaccurate for high-dimensional data with relatively fewer samples. (See Figure 1c, 1d for illustration)<br />
<br />
== Popular Dataset Benchmark Result ==<br />
<br />
[[File:drocc_auc.jpg | center]]<br />
<div align="center">Figure 2: AUC result</div><br />
<br />
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10 is shown in table 1. DROCC outperforms baselines on most classes, with gains as high at 20%, and notably, nearest neighbors (NN) beats all the baselines on 2 classes.<br />
<br />
[[File:drocc_f1score.jpg | center]]<br />
<div align="center">Figure 3: F1-Score</div><br />
<br />
Table 2 shows F1-Score (with standard deviation) for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets from UCI Machine Learning Repository. DROCC outperforms the baselines on all the three datasets by a minimum of 0.07 which is about 11.5% performance increase.<br />
Results on One-class Classification with Limited Negatives (OCLN): <br />
[[File:ocln.jpg | center]]<br />
<div align="center">Figure 4: Sample postives, negatives and close negatives for MNIST digit 0 vs 1 experiment (OCLN).</div><br />
MNIST 0 vs. 1 Classification: <br />
We consider an experimental setup on MNIST dataset, where the training data consists of the Digit 0, the normal class, and the Digit 1 as the anomaly. During evaluation, in addition to samples from training distribution, we also have half zeros, which act as challenging OOD points (close negatives). These half zeros are generated by randomly masking 50% of the pixels (Figure 2). BCE performs poorly, with a recall of 54% only at a fixed FPR of 3%. DROCC–OE gives a recall value of 98:16% outperforming DeepSAD by a margin of 7%, which gives a recall value of 90:91%. DROCC–LF provides further improvement with a recall of 99:4% at 3% FPR. <br />
<br />
[[File:ocln_2.jpg | center]]<br />
<div align="center">Figure 5: OCLN on Audio Commands.</div><br />
Wake word Detection: <br />
Finally, we evaluate DROCC–LF on the practical problem of wake word detection with low FPR against arbitrary OOD negatives. To this end, we identify a keyword, say “Marvin” from the audio commands dataset (Warden, 2018) [8] as the positive class, and the remaining 34 keywords are labeled as the negative class. For training, we sample points uniformly at random from the above-mentioned dataset. However, for evaluation, we sample positives from the train distribution, but negatives contain a few challenging OOD points as well. Sampling challenging negatives itself is a hard task and is the key motivating reason for studying the problem. So, we manually list close-by keywords to Marvin such as: Mar, Vin, Marvelous etc. We then generate audio snippets for these keywords via a speech synthesis tool 2 with a variety of accents.<br />
Figure 3 shows that for 3% and 5% FPR settings, DROCC–LF is significantly more accurate than the baselines. For example, with FPR=3%, DROCC–LF is 10% more accurate than the baselines. We repeated the same experiment with the keyword: Seven, and observed a similar trend. In summary, DROCC–LF is able to generalize well against negatives that are “close” to the true positives even when such negatives were not supplied with the training data.<br />
<br />
== Conclusion and Future Work ==<br />
We introduced DROCC method for deep anomaly detection. It models normal data points using a low-dimensional manifold, and hence can compare close point via Euclidean distance. Based on this intuition, DROCC’s optimization is formulated as a saddle point problem which is solved via standard gradient descent-ascent algorithm. We then extended DROCC to OCLN problem where the goal is to generalize well against arbitrary negatives, assuming positive class is well sampled and a small number of negative points are also available. Both the methods perform significantly better than strong baselines, in their respective problem settings. <br />
<br />
For computational efficiency, we simplified the projection set for both the methods which can perhaps slow down the convergence of the two methods. Designing optimization algorithms that can work with the stricter set is an exciting research direction. Further, we would also like to rigorously analyze DROCC, assuming enough samples from a low-curvature manifold. Finally, as OCLN is an exciting problem that routinely comes up in a variety of real-world applications, we would like to apply DROCC–LF to a few high impact scenarios.<br />
<br />
The results of this study showed that DROCC is comparatively better for anomaly detection across many different areas, such as tabular data, images, audio, and time series, when compared to existing state-of-the-art techniques.<br />
<br />
It would be interesting to see how the DROCC method performs in situations where the anomaly is very rare, say detecting signals of volacanic explosion from seismic activity data. Such challening anomalous situations will be a test of endurance for this method and can even help advance work in this area.<br />
<br />
== References ==<br />
[1]: Golan, I. and El-Yaniv, R. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.<br />
<br />
[2]: Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., M¨uller, E., and Kloft, M. Deep one-class classification. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
[3]: Aggarwal, C. C. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition, 2016. ISBN 3319475770.<br />
<br />
[4]: Sch¨olkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., and Platt, J. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, 1999.<br />
<br />
[5]: Tax, D. M. and Duin, R. P. Support vector data description. Machine Learning, 54(1), 2004.<br />
<br />
[6]: Pless, R. and Souvenir, R. A survey of manifold learning for images. IPSJ Transactions on Computer Vision and Applications, 1, 2009.<br />
<br />
[7]: Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations (ICLR), 2019a.<br />
<br />
[8]: Warden, P. Speech commands: A dataset for limited vocabulary speech recognition, 2018. URL https: //arxiv.org/abs/1804.03209.<br />
<br />
== Critiques/Insights ==<br />
<br />
1. It would be interesting to see this implemented in self driving cars, for instance to detect unusual road conditions.<br />
<br />
2. Figure 1 shows a good representation on how this model works. However, how can we know that this model is not prone to overfitting? There are many situations where there are valid points that lie outside of the line, especially new data that the model has never see before. An explanation on how this is avoided would be good.<br />
<br />
3.In the introduction part, it should first explain what is "one class", and then make detailed application. Moreover, special definition words are used in many places in the text. No detailed explanation was given. In the end, the future application fields of DROCC and the research direction of the group can be explained.<br />
<br />
4. The geometry of this technique (classification based on incidence outside of a ball centered at a known point <math>x</math>) sounds quite similar to K-nearest neighbors. While the authors compared DROCC to single-nearest neighbor classification, choosing a higher K would result in a stronger, more regularized model. It would be interesting to see how DROCC compares to the general KNN classifier</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44557Adversarial Attacks on Copyright Detection Systems2020-11-15T15:55:42Z<p>T34sun: /* Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== Types of copyright detection systems ==<br />
Fingerprinting algorithm is to extract the features of the source file as a hash and then compare that to the materials protected by copyright in the database. If enough matches are found between the source and existing data, the copyright detection system is able to reject the copyright declaration of the source. Most audio, image and video fingerprinting algorithms work by training a neural network to output features or extracting hand-crafted features.<br />
<br />
In terms of video fingerprinting, a useful algorithm is to detect the entering/leaving time of the objects in the video (Saviaga & Toxtli, 2018). The final hash consists of the entering/leaving of different objects and a unique relationship of the objects. However, most of these video fingerprinting algorithms only train their neural networks by using simple distortions such as adding noise or flipping the video rather than adversarial perturbations. This leads to that these algorithms are strong against pre-defined distortions but not adversarial attacks.<br />
<br />
Moreover, some plagiarism detection systems also depend on neural networks to generate a fingerprint of the input document. Though using deep feature representations as a fingerprinting is efficient in detecting plagiarism, it still might be weak to adversarial attacks.<br />
<br />
Audio fingerprinting may perform better than the algorithms above since the most of time, the hash is generated by extracting hand-crafted features rather than training a neural network. But it still is easy to attack.<br />
<br />
== Case study: evading audio fingerprinting ==<br />
<br />
== 3.1 Audio Fingerprinting Model==<br />
The audio fingerprinting model plays an important role in copyright detection. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three 3 principles: temporally localized, translation invariant, and robustness, The Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.<br />
<br />
=== Interpreting the fingerprint extractor as a CNN ===<br />
The intention of this section is to build a differentiable neural network whose function resembling that of an audio fingerprinting algorithm, which is well-known for its ability to identify the meta-data, i.e. song names, artists and albums, while independent of audio format (Group et al., 2005). The generic neural network will then be used as an example of implementing black-box attacks on many popular real-world systems, in this case, YouTube and AudioTag. <br />
<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
$$ f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})} $$ <br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
$$f_{2}(k,n)=e^{-i 2 \pi k n / N} $$<br />
where k <math>{\in}</math> 0,1,...,N-1 (output channel index) and n <math>{\in}</math> 0,1,...,N-1 (index of filter coefficient)<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
=== Formulating the adversarial loss function ===<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different (Hamming distance, 2020). For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 3.4 Remix adversarial examples==<br />
While solving the optimization problem, the resulted example would be able to fool the copyright detection system. But it could sound unnatural with the perturbations. Instead, the fingerprinting could be made in a more natural way (i.e., a different audio signal). By adding a term with the loss function, which switches the order of the max-pooling layers in the smooth maximum components in the loss function, a new optimization problem could be defined. This optimization problem is able to generate an adversarial example from the selected source, and also enforce the adversarial example to be similar to another signal. The resulting adversarial example is called Remix adversarial example because it gets the references to its source signal and another signal.<br />
<br />
== Evaluating transfer attacks on industrial systems==<br />
The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. <math>{l_{\infty}}</math> norm and <math>{l_{2}}</math> norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.<br />
<br />
Before evaluating black-box attacks against real-world systems, white-box attacks are used to provide the baseline of adversarial examples’ effectiveness. Loss function <math>{J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|}</math> is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.<br />
<br />
[[File:Table_1_White-box.jpg |center ]]<br />
<br />
<div align="center">Table 1: Norms of the perturbations for white-box attacks</div><br />
<br />
In black-box attacks, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations. <br />
<br />
Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter <math>{\epsilon}</math> is required. YouTube Content ID system has a more robust fingerprint model.<br />
<br />
<br />
[[File:Table_2_Black-box.jpg |center]]<br />
<br />
<div align="center">Table 2: Norms of the perturbations for black-box attacks</div><br />
<br />
[[File:YouTube_Figure.jpg |center]]<br />
<br />
<div align="center">Figure 2: YouTube’s copyright detection recall against the magnitude of noise</div><br />
<br />
== Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==<br />
<br />
Group, P., Cano, P., Group, M., Group, E., Batlle, E., Ton Kalker Philips Research Laboratories Eindhoven, . . . Authors: Pedro Cano Music Technology Group. (2005, November 01). A Review of Audio Fingerprinting. Retrieved November 13, 2020, from https://dl.acm.org/doi/10.1007/s11265-005-4151-3<br />
<br />
Hamming distance. (2020, November 1). In ''Wikipedia''. https://en.wikipedia.org/wiki/Hamming_distance<br />
<br />
Jovanovic. (2015, February 2). ''How does Shazam work? Music Recognition Algorithms, Fingerprinting, and Processing''. Toptal Engineering Blog. https://www.toptal.com/algorithms/shazam-it-music-processing-fingerprinting-and-recognition<br />
<br />
Saadatpanah, P., Shafahi, A., &amp; Goldstein, T. (2019, June 17). ''Adversarial attacks on copyright detection systems''. Retrieved November 13, 2020, from https://arxiv.org/abs/1906.07153.<br />
<br />
Saviaga, C. and Toxtli, C. ''Deepiracy: Video piracy detection system by using longest common subsequence and deep learning'', 2018. https://medium.com/hciwvu/piracy-detection-using-longestcommon-subsequence-and-neuralnetworks-a6f689a541a6<br />
<br />
Wang, A. et al. ''An industrial strength audio search algorithm''. In Ismir, volume 2003, pp. 7–13. Washington, DC, 2003.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44554Adversarial Attacks on Copyright Detection Systems2020-11-15T15:55:29Z<p>T34sun: /* 3.3 Formulating the adversarial loss function */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== Types of copyright detection systems ==<br />
Fingerprinting algorithm is to extract the features of the source file as a hash and then compare that to the materials protected by copyright in the database. If enough matches are found between the source and existing data, the copyright detection system is able to reject the copyright declaration of the source. Most audio, image and video fingerprinting algorithms work by training a neural network to output features or extracting hand-crafted features.<br />
<br />
In terms of video fingerprinting, a useful algorithm is to detect the entering/leaving time of the objects in the video (Saviaga & Toxtli, 2018). The final hash consists of the entering/leaving of different objects and a unique relationship of the objects. However, most of these video fingerprinting algorithms only train their neural networks by using simple distortions such as adding noise or flipping the video rather than adversarial perturbations. This leads to that these algorithms are strong against pre-defined distortions but not adversarial attacks.<br />
<br />
Moreover, some plagiarism detection systems also depend on neural networks to generate a fingerprint of the input document. Though using deep feature representations as a fingerprinting is efficient in detecting plagiarism, it still might be weak to adversarial attacks.<br />
<br />
Audio fingerprinting may perform better than the algorithms above since the most of time, the hash is generated by extracting hand-crafted features rather than training a neural network. But it still is easy to attack.<br />
<br />
== 3. Case study: evading audio fingerprinting ==<br />
<br />
== 3.1 Audio Fingerprinting Model==<br />
The audio fingerprinting model plays an important role in copyright detection. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three 3 principles: temporally localized, translation invariant, and robustness, The Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.<br />
<br />
== Interpreting the fingerprint extractor as a CNN ==<br />
The intention of this section is to build a differentiable neural network whose function resembling that of an audio fingerprinting algorithm, which is well-known for its ability to identify the meta-data, i.e. song names, artists and albums, while independent of audio format (Group et al., 2005). The generic neural network will then be used as an example of implementing black-box attacks on many popular real-world systems, in this case, YouTube and AudioTag. <br />
<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
$$ f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})} $$ <br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
$$f_{2}(k,n)=e^{-i 2 \pi k n / N} $$<br />
where k <math>{\in}</math> 0,1,...,N-1 (output channel index) and n <math>{\in}</math> 0,1,...,N-1 (index of filter coefficient)<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
=== Formulating the adversarial loss function ===<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different (Hamming distance, 2020). For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 3.4 Remix adversarial examples==<br />
While solving the optimization problem, the resulted example would be able to fool the copyright detection system. But it could sound unnatural with the perturbations. Instead, the fingerprinting could be made in a more natural way (i.e., a different audio signal). By adding a term with the loss function, which switches the order of the max-pooling layers in the smooth maximum components in the loss function, a new optimization problem could be defined. This optimization problem is able to generate an adversarial example from the selected source, and also enforce the adversarial example to be similar to another signal. The resulting adversarial example is called Remix adversarial example because it gets the references to its source signal and another signal.<br />
<br />
== 4. Evaluating transfer attacks on industrial systems==<br />
The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. <math>{l_{\infty}}</math> norm and <math>{l_{2}}</math> norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.<br />
<br />
Before evaluating black-box attacks against real-world systems, white-box attacks are used to provide the baseline of adversarial examples’ effectiveness. Loss function <math>{J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|}</math> is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.<br />
<br />
[[File:Table_1_White-box.jpg |center ]]<br />
<br />
<div align="center">Table 1: Norms of the perturbations for white-box attacks</div><br />
<br />
In black-box attacks, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations. <br />
<br />
Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter <math>{\epsilon}</math> is required. YouTube Content ID system has a more robust fingerprint model.<br />
<br />
<br />
[[File:Table_2_Black-box.jpg |center]]<br />
<br />
<div align="center">Table 2: Norms of the perturbations for black-box attacks</div><br />
<br />
[[File:YouTube_Figure.jpg |center]]<br />
<br />
<div align="center">Figure 2: YouTube’s copyright detection recall against the magnitude of noise</div><br />
<br />
== Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==<br />
<br />
Group, P., Cano, P., Group, M., Group, E., Batlle, E., Ton Kalker Philips Research Laboratories Eindhoven, . . . Authors: Pedro Cano Music Technology Group. (2005, November 01). A Review of Audio Fingerprinting. Retrieved November 13, 2020, from https://dl.acm.org/doi/10.1007/s11265-005-4151-3<br />
<br />
Hamming distance. (2020, November 1). In ''Wikipedia''. https://en.wikipedia.org/wiki/Hamming_distance<br />
<br />
Jovanovic. (2015, February 2). ''How does Shazam work? Music Recognition Algorithms, Fingerprinting, and Processing''. Toptal Engineering Blog. https://www.toptal.com/algorithms/shazam-it-music-processing-fingerprinting-and-recognition<br />
<br />
Saadatpanah, P., Shafahi, A., &amp; Goldstein, T. (2019, June 17). ''Adversarial attacks on copyright detection systems''. Retrieved November 13, 2020, from https://arxiv.org/abs/1906.07153.<br />
<br />
Saviaga, C. and Toxtli, C. ''Deepiracy: Video piracy detection system by using longest common subsequence and deep learning'', 2018. https://medium.com/hciwvu/piracy-detection-using-longestcommon-subsequence-and-neuralnetworks-a6f689a541a6<br />
<br />
Wang, A. et al. ''An industrial strength audio search algorithm''. In Ismir, volume 2003, pp. 7–13. Washington, DC, 2003.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44552Adversarial Attacks on Copyright Detection Systems2020-11-15T15:55:08Z<p>T34sun: /* 5. Conclusion */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== 2. Types of copyright detection systems ==<br />
Fingerprinting algorithm is to extract the features of the source file as a hash and then compare that to the materials protected by copyright in the database. If enough matches are found between the source and existing data, the copyright detection system is able to reject the copyright declaration of the source. Most audio, image and video fingerprinting algorithms work by training a neural network to output features or extracting hand-crafted features.<br />
<br />
In terms of video fingerprinting, a useful algorithm is to detect the entering/leaving time of the objects in the video (Saviaga & Toxtli, 2018). The final hash consists of the entering/leaving of different objects and a unique relationship of the objects. However, most of these video fingerprinting algorithms only train their neural networks by using simple distortions such as adding noise or flipping the video rather than adversarial perturbations. This leads to that these algorithms are strong against pre-defined distortions but not adversarial attacks.<br />
<br />
Moreover, some plagiarism detection systems also depend on neural networks to generate a fingerprint of the input document. Though using deep feature representations as a fingerprinting is efficient in detecting plagiarism, it still might be weak to adversarial attacks.<br />
<br />
Audio fingerprinting may perform better than the algorithms above since the most of time, the hash is generated by extracting hand-crafted features rather than training a neural network. But it still is easy to attack.<br />
<br />
== 3. Case study: evading audio fingerprinting ==<br />
<br />
== 3.1 Audio Fingerprinting Model==<br />
The audio fingerprinting model plays an important role in copyright detection. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three 3 principles: temporally localized, translation invariant, and robustness, The Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.<br />
<br />
== Interpreting the fingerprint extractor as a CNN ==<br />
The intention of this section is to build a differentiable neural network whose function resembling that of an audio fingerprinting algorithm, which is well-known for its ability to identify the meta-data, i.e. song names, artists and albums, while independent of audio format (Group et al., 2005). The generic neural network will then be used as an example of implementing black-box attacks on many popular real-world systems, in this case, YouTube and AudioTag. <br />
<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
$$ f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})} $$ <br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
$$f_{2}(k,n)=e^{-i 2 \pi k n / N} $$<br />
where k <math>{\in}</math> 0,1,...,N-1 (output channel index) and n <math>{\in}</math> 0,1,...,N-1 (index of filter coefficient)<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
=== 3.3 Formulating the adversarial loss function ===<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different (Hamming distance, 2020). For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 3.4 Remix adversarial examples==<br />
While solving the optimization problem, the resulted example would be able to fool the copyright detection system. But it could sound unnatural with the perturbations. Instead, the fingerprinting could be made in a more natural way (i.e., a different audio signal). By adding a term with the loss function, which switches the order of the max-pooling layers in the smooth maximum components in the loss function, a new optimization problem could be defined. This optimization problem is able to generate an adversarial example from the selected source, and also enforce the adversarial example to be similar to another signal. The resulting adversarial example is called Remix adversarial example because it gets the references to its source signal and another signal.<br />
<br />
== 4. Evaluating transfer attacks on industrial systems==<br />
The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. <math>{l_{\infty}}</math> norm and <math>{l_{2}}</math> norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.<br />
<br />
Before evaluating black-box attacks against real-world systems, white-box attacks are used to provide the baseline of adversarial examples’ effectiveness. Loss function <math>{J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|}</math> is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.<br />
<br />
[[File:Table_1_White-box.jpg |center ]]<br />
<br />
<div align="center">Table 1: Norms of the perturbations for white-box attacks</div><br />
<br />
In black-box attacks, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations. <br />
<br />
Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter <math>{\epsilon}</math> is required. YouTube Content ID system has a more robust fingerprint model.<br />
<br />
<br />
[[File:Table_2_Black-box.jpg |center]]<br />
<br />
<div align="center">Table 2: Norms of the perturbations for black-box attacks</div><br />
<br />
[[File:YouTube_Figure.jpg |center]]<br />
<br />
<div align="center">Figure 2: YouTube’s copyright detection recall against the magnitude of noise</div><br />
<br />
== Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==<br />
<br />
Group, P., Cano, P., Group, M., Group, E., Batlle, E., Ton Kalker Philips Research Laboratories Eindhoven, . . . Authors: Pedro Cano Music Technology Group. (2005, November 01). A Review of Audio Fingerprinting. Retrieved November 13, 2020, from https://dl.acm.org/doi/10.1007/s11265-005-4151-3<br />
<br />
Hamming distance. (2020, November 1). In ''Wikipedia''. https://en.wikipedia.org/wiki/Hamming_distance<br />
<br />
Jovanovic. (2015, February 2). ''How does Shazam work? Music Recognition Algorithms, Fingerprinting, and Processing''. Toptal Engineering Blog. https://www.toptal.com/algorithms/shazam-it-music-processing-fingerprinting-and-recognition<br />
<br />
Saadatpanah, P., Shafahi, A., &amp; Goldstein, T. (2019, June 17). ''Adversarial attacks on copyright detection systems''. Retrieved November 13, 2020, from https://arxiv.org/abs/1906.07153.<br />
<br />
Saviaga, C. and Toxtli, C. ''Deepiracy: Video piracy detection system by using longest common subsequence and deep learning'', 2018. https://medium.com/hciwvu/piracy-detection-using-longestcommon-subsequence-and-neuralnetworks-a6f689a541a6<br />
<br />
Wang, A. et al. ''An industrial strength audio search algorithm''. In Ismir, volume 2003, pp. 7–13. Washington, DC, 2003.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44549Adversarial Attacks on Copyright Detection Systems2020-11-15T15:54:56Z<p>T34sun: /* =. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== 2. Types of copyright detection systems ==<br />
Fingerprinting algorithm is to extract the features of the source file as a hash and then compare that to the materials protected by copyright in the database. If enough matches are found between the source and existing data, the copyright detection system is able to reject the copyright declaration of the source. Most audio, image and video fingerprinting algorithms work by training a neural network to output features or extracting hand-crafted features.<br />
<br />
In terms of video fingerprinting, a useful algorithm is to detect the entering/leaving time of the objects in the video (Saviaga & Toxtli, 2018). The final hash consists of the entering/leaving of different objects and a unique relationship of the objects. However, most of these video fingerprinting algorithms only train their neural networks by using simple distortions such as adding noise or flipping the video rather than adversarial perturbations. This leads to that these algorithms are strong against pre-defined distortions but not adversarial attacks.<br />
<br />
Moreover, some plagiarism detection systems also depend on neural networks to generate a fingerprint of the input document. Though using deep feature representations as a fingerprinting is efficient in detecting plagiarism, it still might be weak to adversarial attacks.<br />
<br />
Audio fingerprinting may perform better than the algorithms above since the most of time, the hash is generated by extracting hand-crafted features rather than training a neural network. But it still is easy to attack.<br />
<br />
== 3. Case study: evading audio fingerprinting ==<br />
<br />
== 3.1 Audio Fingerprinting Model==<br />
The audio fingerprinting model plays an important role in copyright detection. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three 3 principles: temporally localized, translation invariant, and robustness, The Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.<br />
<br />
== Interpreting the fingerprint extractor as a CNN ==<br />
The intention of this section is to build a differentiable neural network whose function resembling that of an audio fingerprinting algorithm, which is well-known for its ability to identify the meta-data, i.e. song names, artists and albums, while independent of audio format (Group et al., 2005). The generic neural network will then be used as an example of implementing black-box attacks on many popular real-world systems, in this case, YouTube and AudioTag. <br />
<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
$$ f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})} $$ <br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
$$f_{2}(k,n)=e^{-i 2 \pi k n / N} $$<br />
where k <math>{\in}</math> 0,1,...,N-1 (output channel index) and n <math>{\in}</math> 0,1,...,N-1 (index of filter coefficient)<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different (Hamming distance, 2020). For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 3.4 Remix adversarial examples==<br />
While solving the optimization problem, the resulted example would be able to fool the copyright detection system. But it could sound unnatural with the perturbations. Instead, the fingerprinting could be made in a more natural way (i.e., a different audio signal). By adding a term with the loss function, which switches the order of the max-pooling layers in the smooth maximum components in the loss function, a new optimization problem could be defined. This optimization problem is able to generate an adversarial example from the selected source, and also enforce the adversarial example to be similar to another signal. The resulting adversarial example is called Remix adversarial example because it gets the references to its source signal and another signal.<br />
<br />
== 4. Evaluating transfer attacks on industrial systems==<br />
The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. <math>{l_{\infty}}</math> norm and <math>{l_{2}}</math> norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.<br />
<br />
Before evaluating black-box attacks against real-world systems, white-box attacks are used to provide the baseline of adversarial examples’ effectiveness. Loss function <math>{J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|}</math> is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.<br />
<br />
[[File:Table_1_White-box.jpg |center ]]<br />
<br />
<div align="center">Table 1: Norms of the perturbations for white-box attacks</div><br />
<br />
In black-box attacks, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations. <br />
<br />
Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter <math>{\epsilon}</math> is required. YouTube Content ID system has a more robust fingerprint model.<br />
<br />
<br />
[[File:Table_2_Black-box.jpg |center]]<br />
<br />
<div align="center">Table 2: Norms of the perturbations for black-box attacks</div><br />
<br />
[[File:YouTube_Figure.jpg |center]]<br />
<br />
<div align="center">Figure 2: YouTube’s copyright detection recall against the magnitude of noise</div><br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==<br />
<br />
Group, P., Cano, P., Group, M., Group, E., Batlle, E., Ton Kalker Philips Research Laboratories Eindhoven, . . . Authors: Pedro Cano Music Technology Group. (2005, November 01). A Review of Audio Fingerprinting. Retrieved November 13, 2020, from https://dl.acm.org/doi/10.1007/s11265-005-4151-3<br />
<br />
Hamming distance. (2020, November 1). In ''Wikipedia''. https://en.wikipedia.org/wiki/Hamming_distance<br />
<br />
Jovanovic. (2015, February 2). ''How does Shazam work? Music Recognition Algorithms, Fingerprinting, and Processing''. Toptal Engineering Blog. https://www.toptal.com/algorithms/shazam-it-music-processing-fingerprinting-and-recognition<br />
<br />
Saadatpanah, P., Shafahi, A., &amp; Goldstein, T. (2019, June 17). ''Adversarial attacks on copyright detection systems''. Retrieved November 13, 2020, from https://arxiv.org/abs/1906.07153.<br />
<br />
Saviaga, C. and Toxtli, C. ''Deepiracy: Video piracy detection system by using longest common subsequence and deep learning'', 2018. https://medium.com/hciwvu/piracy-detection-using-longestcommon-subsequence-and-neuralnetworks-a6f689a541a6<br />
<br />
Wang, A. et al. ''An industrial strength audio search algorithm''. In Ismir, volume 2003, pp. 7–13. Washington, DC, 2003.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44547Adversarial Attacks on Copyright Detection Systems2020-11-15T15:54:31Z<p>T34sun: /* 3.2. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== 2. Types of copyright detection systems ==<br />
Fingerprinting algorithm is to extract the features of the source file as a hash and then compare that to the materials protected by copyright in the database. If enough matches are found between the source and existing data, the copyright detection system is able to reject the copyright declaration of the source. Most audio, image and video fingerprinting algorithms work by training a neural network to output features or extracting hand-crafted features.<br />
<br />
In terms of video fingerprinting, a useful algorithm is to detect the entering/leaving time of the objects in the video (Saviaga & Toxtli, 2018). The final hash consists of the entering/leaving of different objects and a unique relationship of the objects. However, most of these video fingerprinting algorithms only train their neural networks by using simple distortions such as adding noise or flipping the video rather than adversarial perturbations. This leads to that these algorithms are strong against pre-defined distortions but not adversarial attacks.<br />
<br />
Moreover, some plagiarism detection systems also depend on neural networks to generate a fingerprint of the input document. Though using deep feature representations as a fingerprinting is efficient in detecting plagiarism, it still might be weak to adversarial attacks.<br />
<br />
Audio fingerprinting may perform better than the algorithms above since the most of time, the hash is generated by extracting hand-crafted features rather than training a neural network. But it still is easy to attack.<br />
<br />
== 3. Case study: evading audio fingerprinting ==<br />
<br />
== 3.1 Audio Fingerprinting Model==<br />
The audio fingerprinting model plays an important role in copyright detection. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three 3 principles: temporally localized, translation invariant, and robustness, The Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.<br />
<br />
===. Interpreting the fingerprint extractor as a CNN ==<br />
The intention of this section is to build a differentiable neural network whose function resembling that of an audio fingerprinting algorithm, which is well-known for its ability to identify the meta-data, i.e. song names, artists and albums, while independent of audio format (Group et al., 2005). The generic neural network will then be used as an example of implementing black-box attacks on many popular real-world systems, in this case, YouTube and AudioTag. <br />
<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
$$ f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})} $$ <br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
$$f_{2}(k,n)=e^{-i 2 \pi k n / N} $$<br />
where k <math>{\in}</math> 0,1,...,N-1 (output channel index) and n <math>{\in}</math> 0,1,...,N-1 (index of filter coefficient)<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different (Hamming distance, 2020). For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 3.4 Remix adversarial examples==<br />
While solving the optimization problem, the resulted example would be able to fool the copyright detection system. But it could sound unnatural with the perturbations. Instead, the fingerprinting could be made in a more natural way (i.e., a different audio signal). By adding a term with the loss function, which switches the order of the max-pooling layers in the smooth maximum components in the loss function, a new optimization problem could be defined. This optimization problem is able to generate an adversarial example from the selected source, and also enforce the adversarial example to be similar to another signal. The resulting adversarial example is called Remix adversarial example because it gets the references to its source signal and another signal.<br />
<br />
== 4. Evaluating transfer attacks on industrial systems==<br />
The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. <math>{l_{\infty}}</math> norm and <math>{l_{2}}</math> norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.<br />
<br />
Before evaluating black-box attacks against real-world systems, white-box attacks are used to provide the baseline of adversarial examples’ effectiveness. Loss function <math>{J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|}</math> is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.<br />
<br />
[[File:Table_1_White-box.jpg |center ]]<br />
<br />
<div align="center">Table 1: Norms of the perturbations for white-box attacks</div><br />
<br />
In black-box attacks, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations. <br />
<br />
Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter <math>{\epsilon}</math> is required. YouTube Content ID system has a more robust fingerprint model.<br />
<br />
<br />
[[File:Table_2_Black-box.jpg |center]]<br />
<br />
<div align="center">Table 2: Norms of the perturbations for black-box attacks</div><br />
<br />
[[File:YouTube_Figure.jpg |center]]<br />
<br />
<div align="center">Figure 2: YouTube’s copyright detection recall against the magnitude of noise</div><br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==<br />
<br />
Group, P., Cano, P., Group, M., Group, E., Batlle, E., Ton Kalker Philips Research Laboratories Eindhoven, . . . Authors: Pedro Cano Music Technology Group. (2005, November 01). A Review of Audio Fingerprinting. Retrieved November 13, 2020, from https://dl.acm.org/doi/10.1007/s11265-005-4151-3<br />
<br />
Hamming distance. (2020, November 1). In ''Wikipedia''. https://en.wikipedia.org/wiki/Hamming_distance<br />
<br />
Jovanovic. (2015, February 2). ''How does Shazam work? Music Recognition Algorithms, Fingerprinting, and Processing''. Toptal Engineering Blog. https://www.toptal.com/algorithms/shazam-it-music-processing-fingerprinting-and-recognition<br />
<br />
Saadatpanah, P., Shafahi, A., &amp; Goldstein, T. (2019, June 17). ''Adversarial attacks on copyright detection systems''. Retrieved November 13, 2020, from https://arxiv.org/abs/1906.07153.<br />
<br />
Saviaga, C. and Toxtli, C. ''Deepiracy: Video piracy detection system by using longest common subsequence and deep learning'', 2018. https://medium.com/hciwvu/piracy-detection-using-longestcommon-subsequence-and-neuralnetworks-a6f689a541a6<br />
<br />
Wang, A. et al. ''An industrial strength audio search algorithm''. In Ismir, volume 2003, pp. 7–13. Washington, DC, 2003.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44539Adversarial Attacks on Copyright Detection Systems2020-11-15T15:43:48Z<p>T34sun: /* 3.2. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== 2. Types of copyright detection systems ==<br />
Fingerprinting algorithm is to extract the features of the source file as a hash and then compare that to the materials protected by copyright in the database. If enough matches are found between the source and existing data, the copyright detection system is able to reject the copyright declaration of the source. Most audio, image and video fingerprinting algorithms work by training a neural network to output features or extracting hand-crafted features.<br />
<br />
In terms of video fingerprinting, a useful algorithm is to detect the entering/leaving time of the objects in the video (Saviaga & Toxtli, 2018). The final hash consists of the entering/leaving of different objects and a unique relationship of the objects. However, most of these video fingerprinting algorithms only train their neural networks by using simple distortions such as adding noise or flipping the video rather than adversarial perturbations. This leads to that these algorithms are strong against pre-defined distortions but not adversarial attacks.<br />
<br />
Moreover, some plagiarism detection systems also depend on neural networks to generate a fingerprint of the input document. Though using deep feature representations as a fingerprinting is efficient in detecting plagiarism, it still might be weak to adversarial attacks.<br />
<br />
Audio fingerprinting may perform better than the algorithms above since the most of time, the hash is generated by extracting hand-crafted features rather than training a neural network. But it still is easy to attack.<br />
<br />
== 3. Case study: evading audio fingerprinting ==<br />
<br />
== 3.1 Audio Fingerprinting Model==<br />
The audio fingerprinting model plays an important role in copyright detection. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three 3 principles: temporally localized, translation invariant, and robustness, The Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The intention of this section is to build a differentiable neural network whose function resembling that of an audio fingerprinting algorithm, which is well-known for its ability to identify the meta-data, i.e. song names, artists and albums, while independent of audio format (Group et al., 2005). The generic neural network will then be used as an example of implementing black-box attacks on many popular real-world systems, in this case, YouTube and AudioTag. <br />
<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
$$ f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})} $$ <br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
$$f_{2}(k,n)=e^{-i 2 \pi k n / N} $$<br />
where k <math>{\in}</math> 0,1,...,N-1 (output channel index) and n <math>{\in}</math> 0,1,...,N-1 (index of filter coefficient)<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different (Hamming distance, 2020). For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 3.4 Remix adversarial examples==<br />
While solving the optimization problem, the resulted example would be able to fool the copyright detection system. But it could sound unnatural with the perturbations. Instead, the fingerprinting could be made in a more natural way (i.e., a different audio signal). By adding a term with the loss function, which switches the order of the max-pooling layers in the smooth maximum components in the loss function, a new optimization problem could be defined. This optimization problem is able to generate an adversarial example from the selected source, and also enforce the adversarial example to be similar to another signal. The resulting adversarial example is called Remix adversarial example because it gets the references to its source signal and another signal.<br />
<br />
== 4. Evaluating transfer attacks on industrial systems==<br />
The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. <math>{l_{\infty}}</math> norm and <math>{l_{2}}</math> norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.<br />
<br />
Before evaluating black-box attacks against real-world systems, white-box attacks are used to provide the baseline of adversarial examples’ effectiveness. Loss function <math>{J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|}</math> is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.<br />
<br />
[[File:Table_1_White-box.jpg |center ]]<br />
<br />
<div align="center">Table 1: Norms of the perturbations for white-box attacks</div><br />
<br />
In black-box attacks, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations. <br />
<br />
Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter <math>{\epsilon}</math> is required. YouTube Content ID system has a more robust fingerprint model.<br />
<br />
<br />
[[File:Table_2_Black-box.jpg |center]]<br />
<br />
<div align="center">Table 2: Norms of the perturbations for black-box attacks</div><br />
<br />
[[File:YouTube_Figure.jpg |center]]<br />
<br />
<div align="center">Figure 2: YouTube’s copyright detection recall against the magnitude of noise</div><br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==<br />
<br />
Group, P., Cano, P., Group, M., Group, E., Batlle, E., Ton Kalker Philips Research Laboratories Eindhoven, . . . Authors: Pedro Cano Music Technology Group. (2005, November 01). A Review of Audio Fingerprinting. Retrieved November 13, 2020, from https://dl.acm.org/doi/10.1007/s11265-005-4151-3<br />
<br />
Hamming distance. (2020, November 1). In ''Wikipedia''. https://en.wikipedia.org/wiki/Hamming_distance<br />
<br />
Jovanovic. (2015, February 2). ''How does Shazam work? Music Recognition Algorithms, Fingerprinting, and Processing''. Toptal Engineering Blog. https://www.toptal.com/algorithms/shazam-it-music-processing-fingerprinting-and-recognition<br />
<br />
Saadatpanah, P., Shafahi, A., &amp; Goldstein, T. (2019, June 17). ''Adversarial attacks on copyright detection systems''. Retrieved November 13, 2020, from https://arxiv.org/abs/1906.07153.<br />
<br />
Wang, A. et al. ''An industrial strength audio search algorithm''. In Ismir, volume 2003, pp. 7–13. Washington, DC, 2003.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44536Adversarial Attacks on Copyright Detection Systems2020-11-15T15:41:19Z<p>T34sun: /* References */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== 2. Types of copyright detection systems ==<br />
Fingerprinting algorithm is to extract the features of the source file as a hash and then compare that to the materials protected by copyright in the database. If enough matches are found between the source and existing data, the copyright detection system is able to reject the copyright declaration of the source. Most audio, image and video fingerprinting algorithms work by training a neural network to output features or extracting hand-crafted features.<br />
<br />
In terms of video fingerprinting, a useful algorithm is to detect the entering/leaving time of the objects in the video (Saviaga & Toxtli, 2018). The final hash consists of the entering/leaving of different objects and a unique relationship of the objects. However, most of these video fingerprinting algorithms only train their neural networks by using simple distortions such as adding noise or flipping the video rather than adversarial perturbations. This leads to that these algorithms are strong against pre-defined distortions but not adversarial attacks.<br />
<br />
Moreover, some plagiarism detection systems also depend on neural networks to generate a fingerprint of the input document. Though using deep feature representations as a fingerprinting is efficient in detecting plagiarism, it still might be weak to adversarial attacks.<br />
<br />
Audio fingerprinting may perform better than the algorithms above since the most of time, the hash is generated by extracting hand-crafted features rather than training a neural network. But it still is easy to attack.<br />
<br />
== 3. Case study: evading audio fingerprinting ==<br />
<br />
== 3.1 Audio Fingerprinting Model==<br />
The audio fingerprinting model plays an important role in copyright detection. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three 3 principles: temporally localized, translation invariant, and robustness, The Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
$$ f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})} $$ <br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
$$f_{2}(k,n)=e^{-i 2 \pi k n / N} $$<br />
where k <math>{\in}</math> 0,1,...,N-1 (output channel index) and n <math>{\in}</math> 0,1,...,N-1 (index of filter coefficient)<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different (Hamming distance, 2020). For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 3.4 Remix adversarial examples==<br />
<br />
== 4. Evaluating transfer attacks on industrial systems==<br />
The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. <math>{l_{\infty}}</math> norm and <math>{l_{2}}</math> norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.<br />
<br />
Before evaluating black-box attacks against real-world systems, white-box attacks are used to provide the baseline of adversarial examples’ effectiveness. Loss function <math>{J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|}</math> is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.<br />
<br />
[[File:Table_1_White-box.jpg |center ]]<br />
<br />
<div align="center">Table 1: Norms of the perturbations for white-box attacks</div><br />
<br />
In black-box attacks, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations. <br />
<br />
Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter <math>{\epsilon}</math> is required. YouTube Content ID system has a more robust fingerprint model.<br />
<br />
<br />
[[File:Table_2_Black-box.jpg |center]]<br />
<br />
<div align="center">Table 2: Norms of the perturbations for black-box attacks</div><br />
<br />
[[File:YouTube_Figure.jpg |center]]<br />
<br />
<div align="center">Figure 2: YouTube’s copyright detection recall against the magnitude of noise</div><br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==<br />
<br />
Group, P., Cano, P., Group, M., Group, E., Batlle, E., Ton Kalker Philips Research Laboratories Eindhoven, . . . Authors: Pedro Cano Music Technology Group. (2005, November 01). A Review of Audio Fingerprinting. Retrieved November 13, 2020, from https://dl.acm.org/doi/10.1007/s11265-005-4151-3<br />
<br />
Hamming distance. (2020, November 1). In ''Wikipedia''. https://en.wikipedia.org/wiki/Hamming_distance<br />
<br />
Jovanovic. (2015, February 2). ''How does Shazam work? Music Recognition Algorithms, Fingerprinting, and Processing''. Toptal Engineering Blog. https://www.toptal.com/algorithms/shazam-it-music-processing-fingerprinting-and-recognition<br />
<br />
Saadatpanah, P., Shafahi, A., &amp; Goldstein, T. (2019, June 17). ''Adversarial attacks on copyright detection systems''. Retrieved November 13, 2020, from https://arxiv.org/abs/1906.07153.<br />
<br />
Wang, A. et al. ''An industrial strength audio search algorithm''. In Ismir, volume 2003, pp. 7–13. Washington, DC, 2003.</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44506Adversarial Attacks on Copyright Detection Systems2020-11-15T15:19:10Z<p>T34sun: /* 3.2. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== 3. Case study: evading audio fingerprinting ==<br />
<br />
== 3.1 Audio Fingerprinting Model==<br />
The audio fingerprinting model plays an important role in copyright detection. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three 3 principles: temporally localized, translation invariant, and robustness, The Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
$$ f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})} $$ <br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
$$f_{2}(k,n)=e^{-i 2 \pi k n / N} $$<br />
where k <math>{\in}</math> 0,1,...,N-1 (output channel index) and n <math>{\in}</math> 0,1,...,N-1 (index of filter coefficient)<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 4. Evaluating transfer attacks on industrial systems==<br />
The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. L_infinity norm and L2 norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.<br />
<br />
Before evaluating black-box attacks against real-world systems, white-box attacks are used to provide the baseline of adversarial examples’ effectiveness. Loss function (3) is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.<br />
<br />
[[File:Table_1_White-box.jpg | center]]<br />
<br />
<div align="center">Table 1: Norms of the perturbations for white-box attacks</div><br />
<br />
In black-box attacks, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations. <br />
<br />
Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter is required. YouTube Content ID system has a more robust fingerprint model.<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44505Adversarial Attacks on Copyright Detection Systems2020-11-15T15:18:49Z<p>T34sun: /* 3.2. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== 3. Case study: evading audio fingerprinting ==<br />
<br />
== 3.1 Audio Fingerprinting Model==<br />
The audio fingerprinting model plays an important role in copyright detection. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three 3 principles: temporally localized, translation invariant, and robustness, The Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
$$ f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})} $$ <br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
$$f_{2}(k,n)=e^{-i 2 \pi k n / N} $$<br />
where k <math>{\in}</math> 0,1,...,N-1 (output channel index) and n <math>(\in)</math> 0,1,...,N-1 (index of filter coefficient)<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 4. Evaluating transfer attacks on industrial systems==<br />
The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. L_infinity norm and L2 norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.<br />
<br />
Before evaluating black-box attacks against real-world systems, white-box attacks are used to provide the baseline of adversarial examples’ effectiveness. Loss function (3) is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.<br />
<br />
[[File:Table_1_White-box.jpg | center]]<br />
<br />
<div align="center">Table 1: Norms of the perturbations for white-box attacks</div><br />
<br />
In black-box attacks, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations. <br />
<br />
Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter is required. YouTube Content ID system has a more robust fingerprint model.<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44504Adversarial Attacks on Copyright Detection Systems2020-11-15T15:18:15Z<p>T34sun: /* 3.2. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== 3. Case study: evading audio fingerprinting ==<br />
<br />
== 3.1 Audio Fingerprinting Model==<br />
The audio fingerprinting model plays an important role in copyright detection. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three 3 principles: temporally localized, translation invariant, and robustness, The Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
$$ f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})} $$ <br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
$$f_{2}(k,n)=e^{-i 2 \pi k n / N} $$<br />
where k <math>{\in}</math> 0,1,...,N-1 (output channel index) and n <math>{\in)</math> 0,1,...,N-1(index of filter coefficient)<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 4. Evaluating transfer attacks on industrial systems==<br />
The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. L_infinity norm and L2 norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.<br />
<br />
Before evaluating black-box attacks against real-world systems, white-box attacks are used to provide the baseline of adversarial examples’ effectiveness. Loss function (3) is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.<br />
<br />
[[File:Table_1_White-box.jpg | center]]<br />
<br />
<div align="center">Table 1: Norms of the perturbations for white-box attacks</div><br />
<br />
In black-box attacks, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations. <br />
<br />
Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter is required. YouTube Content ID system has a more robust fingerprint model.<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44500Adversarial Attacks on Copyright Detection Systems2020-11-15T15:15:10Z<p>T34sun: /* 3.2. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== 3. Case study: evading audio fingerprinting ==<br />
<br />
== 3.1 Audio Fingerprinting Model==<br />
The audio fingerprinting model plays an important role in copyright detection. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three 3 principles: temporally localized, translation invariant, and robustness, The Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
$$ f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})} $$ <br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
$$f_{2}(k,n)=e^{-i 2 \pi k n / N} $$<br />
where k $\in$ 0,1,...,N-1 (output channel index) and n$\in$ 0,1,...,N-1(index of filter coefficient)<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 4. Evaluating transfer attacks on industrial systems==<br />
The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. L_infinity norm and L2 norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.<br />
<br />
Before evaluating black-box attacks against real-world systems, white-box attacks are used to provide the baseline of adversarial examples’ effectiveness. Loss function (3) is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.<br />
<br />
[[File:Table_1_White-box.jpg | center]]<br />
<br />
<div align="center">Table 1: Norms of the perturbations for white-box attacks</div><br />
<br />
In black-box attacks, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations. <br />
<br />
Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter is required. YouTube Content ID system has a more robust fingerprint model.<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44491Adversarial Attacks on Copyright Detection Systems2020-11-15T15:10:01Z<p>T34sun: /* 3.2. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== 3. Case study: evading audio fingerprinting ==<br />
<br />
== 3.1 Audio Fingerprinting Model==<br />
The audio fingerprinting model plays an important role in copyright detection. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three 3 principles: temporally localized, translation invariant, and robustness, The Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
$$ f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})} $$ <br />
[[File:Layer_1.png | thumb | center | 500px ]]<br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
[[File:Layer_2.png | thumb | center | 500px ]]<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
<br />
Lastly, instead of the maximum operator, smoothed max function is used here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
where <math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44479Adversarial Attacks on Copyright Detection Systems2020-11-15T15:04:55Z<p>T34sun: /* 3.2. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
[[File:Layer_1.png | thumb | center | 500px ]]<br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
[[File:Layer_2.png | thumb | center | 500px ]]<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
Lastly, instead of the maximum operator, smoothed max function is replaced here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
<math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44476Adversarial Attacks on Copyright Detection Systems2020-11-15T15:03:49Z<p>T34sun: /* 3.2. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel. <br />
<br />
[[File:Layer_1.png | thumb | center | 500px ]]<br />
<br />
The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.<br />
The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below. <br />
<br />
[[File:Layer_2.png | thumb | center | 500px ]]<br />
<br />
The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. <br />
However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where <math>{||\delta||_p\le\epsilon}</math> is the <math>{l_p}</math>-norm of the perturbation and <math>{\epsilon}</math> is the bound of the difference between the original file and the adversarial example. <br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let <math>{\psi(x)}</math> and <math>{\psi(y)}</math> be two binary fingerprints outputted from the model, the number of peaks shared by <math>{x}</math> and <math>{y}</math> can be found through <math>{|\psi(x)\cdot\psi(y)|}</math>. Now, to get a differentiable loss function, the equation is found to be <br />
<br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in <math>{\psi(x)}</math> outside of neighborhood of the peaks of <math>{\psi(y)}</math>. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width <math>{w_1}</math> while the other one has a smaller width <math>{w_2}</math>. For any location, if the output of <math>{w_1}</math> pooling is strictly greater than the output of <math>{w_2}</math> pooling, then it can be concluded that no peak in that location with radius <math>{w_2}</math>. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of <math>{x}</math> which are in neighborhood of peaks of <math>{y}</math> with radius of <math>{w_2}</math>. The activation function uses <math>{ReLU}</math>. <math>{c}</math> is the difference between the output of two max-pooling layers. <br />
<br />
Lastly, instead of the maximum operator, smoothed max function is replaced here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
<math>{\alpha}</math> is a smoothing hyper parameter. When <math>{\alpha}</math> approaches positive infinity, <math>{S_\alpha}</math> is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where <math>{x}</math> is the input signal, <math>{J}</math> is the loss function with the smoothed max function.<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Layer_2.png&diff=44473File:Layer 2.png2020-11-15T15:01:54Z<p>T34sun: </p>
<hr />
<div></div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Layer_1.png&diff=44472File:Layer 1.png2020-11-15T15:01:24Z<p>T34sun: </p>
<hr />
<div></div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44471Adversarial Attacks on Copyright Detection Systems2020-11-15T15:00:39Z<p>T34sun: /* 3.2. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center | 500px ]]<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where $||\delta||_p\le\epsilon$ is the $l_p$-norm of the perturbation and $\epsilon$ is the bound of the difference between the original file and the adversarial example. <br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let $\psi(x)$ and $\psi(y)$ be two binary fingerprints outputted from the model, the number of peaks shared by $x$ and $y$ can be found through $|\psi(x)\cdot\psi(y)|$. Now, to get a differentiable loss function, the equation is found to be <br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in $\phi(x)$ outside of neighborhood of the peaks of $\phi(y)$. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width $w_1$ while the other one has a smaller width $w_2$. For any location, if the output of $w_1$ pooling is strictly greater than the output of $w_2$ pooling, then it can be concluded that no peak in that location with radius $w_2$. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of $x$ which are in neighborhood of peaks of $y$ with radius of $w_2$. The activation function uses $ReLU$. $c$ is the difference between the output of two max-pooling layers. <br />
<br />
Lastly, instead of the maximum operator, smoothed max function is replaced here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
$\alpha$ is a smoothing hyper parameter. When $\alpha$ approaches positive infinity, $S_\alpha$ is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where $x$ is the input signal, $J$ is the loss function with the smoothed max function.<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44470Adversarial Attacks on Copyright Detection Systems2020-11-15T14:58:39Z<p>T34sun: /* 3.2. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov network.png | thumb | center ]]<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation <math>{\delta}</math>, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where $||\delta||_p\le\epsilon$ is the $l_p$-norm of the perturbation and $\epsilon$ is the bound of the difference between the original file and the adversarial example. <br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let $\psi(x)$ and $\psi(y)$ be two binary fingerprints outputted from the model, the number of peaks shared by $x$ and $y$ can be found through $|\psi(x)\cdot\psi(y)|$. Now, to get a differentiable loss function, the equation is found to be <br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in $\phi(x)$ outside of neighborhood of the peaks of $\phi(y)$. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width $w_1$ while the other one has a smaller width $w_2$. For any location, if the output of $w_1$ pooling is strictly greater than the output of $w_2$ pooling, then it can be concluded that no peak in that location with radius $w_2$. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of $x$ which are in neighborhood of peaks of $y$ with radius of $w_2$. The activation function uses $ReLU$. $c$ is the difference between the output of two max-pooling layers. <br />
<br />
Lastly, instead of the maximum operator, smoothed max function is replaced here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
$\alpha$ is a smoothing hyper parameter. When $\alpha$ approaches positive infinity, $S_\alpha$ is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where $x$ is the input signal, $J$ is the loss function with the smoothed max function.<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44467Adversarial Attacks on Copyright Detection Systems2020-11-15T14:57:01Z<p>T34sun: /* 3.2. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:cov_network.jpg | thumb | center | 1000px]]<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation $\delta$, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where $||\delta||_p\le\epsilon$ is the $l_p$-norm of the perturbation and $\epsilon$ is the bound of the difference between the original file and the adversarial example. <br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let $\psi(x)$ and $\psi(y)$ be two binary fingerprints outputted from the model, the number of peaks shared by $x$ and $y$ can be found through $|\psi(x)\cdot\psi(y)|$. Now, to get a differentiable loss function, the equation is found to be <br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in $\phi(x)$ outside of neighborhood of the peaks of $\phi(y)$. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width $w_1$ while the other one has a smaller width $w_2$. For any location, if the output of $w_1$ pooling is strictly greater than the output of $w_2$ pooling, then it can be concluded that no peak in that location with radius $w_2$. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of $x$ which are in neighborhood of peaks of $y$ with radius of $w_2$. The activation function uses $ReLU$. $c$ is the difference between the output of two max-pooling layers. <br />
<br />
Lastly, instead of the maximum operator, smoothed max function is replaced here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
$\alpha$ is a smoothing hyper parameter. When $\alpha$ approaches positive infinity, $S_\alpha$ is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where $x$ is the input signal, $J$ is the loss function with the smoothed max function. <br />
<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:cov_network.png&diff=44466File:cov network.png2020-11-15T14:56:10Z<p>T34sun: </p>
<hr />
<div></div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44465Adversarial Attacks on Copyright Detection Systems2020-11-15T14:55:16Z<p>T34sun: /* 3.2. Interpreting the fingerprint extractor as a CNN */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
<br />
[[File:1.jpg | thumb | center | 1000px]]<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation $\delta$, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where $||\delta||_p\le\epsilon$ is the $l_p$-norm of the perturbation and $\epsilon$ is the bound of the difference between the original file and the adversarial example. <br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let $\psi(x)$ and $\psi(y)$ be two binary fingerprints outputted from the model, the number of peaks shared by $x$ and $y$ can be found through $|\psi(x)\cdot\psi(y)|$. Now, to get a differentiable loss function, the equation is found to be <br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in $\phi(x)$ outside of neighborhood of the peaks of $\phi(y)$. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width $w_1$ while the other one has a smaller width $w_2$. For any location, if the output of $w_1$ pooling is strictly greater than the output of $w_2$ pooling, then it can be concluded that no peak in that location with radius $w_2$. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of $x$ which are in neighborhood of peaks of $y$ with radius of $w_2$. The activation function uses $ReLU$. $c$ is the difference between the output of two max-pooling layers. <br />
<br />
Lastly, instead of the maximum operator, smoothed max function is replaced here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
$\alpha$ is a smoothing hyper parameter. When $\alpha$ approaches positive infinity, $S_\alpha$ is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where $x$ is the input signal, $J$ is the loss function with the smoothed max function. <br />
<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44464Adversarial Attacks on Copyright Detection Systems2020-11-15T14:52:52Z<p>T34sun: </p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
==3.2. Interpreting the fingerprint extractor as a CNN ==<br />
The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.<br />
[[File:Screen Shot 2020-11-14 at 9.48.11 PM.jpg]]<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation $\delta$, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where $||\delta||_p\le\epsilon$ is the $l_p$-norm of the perturbation and $\epsilon$ is the bound of the difference between the original file and the adversarial example. <br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let $\psi(x)$ and $\psi(y)$ be two binary fingerprints outputted from the model, the number of peaks shared by $x$ and $y$ can be found through $|\psi(x)\cdot\psi(y)|$. Now, to get a differentiable loss function, the equation is found to be <br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in $\phi(x)$ outside of neighborhood of the peaks of $\phi(y)$. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width $w_1$ while the other one has a smaller width $w_2$. For any location, if the output of $w_1$ pooling is strictly greater than the output of $w_2$ pooling, then it can be concluded that no peak in that location with radius $w_2$. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of $x$ which are in neighborhood of peaks of $y$ with radius of $w_2$. The activation function uses $ReLU$. $c$ is the difference between the output of two max-pooling layers. <br />
<br />
Lastly, instead of the maximum operator, smoothed max function is replaced here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
$\alpha$ is a smoothing hyper parameter. When $\alpha$ approaches positive infinity, $S_\alpha$ is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where $x$ is the input signal, $J$ is the loss function with the smoothed max function. <br />
<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44463Adversarial Attacks on Copyright Detection Systems2020-11-15T14:51:01Z<p>T34sun: </p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
== 3.3 Formulating the adversarial loss function ==<br />
<br />
In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation $\delta$, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. <br />
$$\text{bound:}\ ||\delta||_p\le\epsilon$$<br />
<br />
where $||\delta||_p\le\epsilon$ is the $l_p$-norm of the perturbation and $\epsilon$ is the bound of the difference between the original file and the adversarial example. <br />
<br />
To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2. <br />
<br />
Let $\psi(x)$ and $\psi(y)$ be two binary fingerprints outputted from the model, the number of peaks shared by $x$ and $y$ can be found through $|\psi(x)\cdot\psi(y)|$. Now, to get a differentiable loss function, the equation is found to be <br />
$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$<br />
<br />
This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in $\phi(x)$ outside of neighborhood of the peaks of $\phi(y)$. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width $w_1$ while the other one has a smaller width $w_2$. For any location, if the output of $w_1$ pooling is strictly greater than the output of $w_2$ pooling, then it can be concluded that no peak in that location with radius $w_2$. <br />
<br />
The loss function is as the following:<br />
<br />
$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$<br />
The equation above penalizes the peaks of $x$ which are in neighborhood of peaks of $y$ with radius of $w_2$. The activation function uses $ReLU$. $c$ is the difference between the output of two max-pooling layers. <br />
<br />
Lastly, instead of the maximum operator, smoothed max function is replaced here:<br />
$$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$<br />
$\alpha$ is a smoothing hyper parameter. When $\alpha$ approaches positive infinity, $S_\alpha$ is closer to the actual max function. <br />
<br />
To summarize, the optimization problem can be formulated as the following:<br />
<br />
$$<br />
\underset{\delta}{\min}J(x+\delta,x)\\<br />
s.t.||\delta||_{\infty}\le\epsilon<br />
$$<br />
where $x$ is the input signal, $J$ is the loss function with the smoothed max function. <br />
<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44459Adversarial Attacks on Copyright Detection Systems2020-11-15T14:46:46Z<p>T34sun: </p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
==1. Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
== Conclusion ==<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== 5. Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sunhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems&diff=44458Adversarial Attacks on Copyright Detection Systems2020-11-15T14:44:44Z<p>T34sun: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun<br />
<br />
== Introduction ==<br />
Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.<br />
<br />
1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.<br />
<br />
2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.<br />
<br />
3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.<br />
<br />
<br />
In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.<br />
<br />
== Conclusion ==<br />
In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.<br />
<br />
== References ==</div>T34sun