## Presented by

Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun

## 1. Introduction

Copyright detection system is one of the most commonly used machine learning systems; however, the hardiness of copyright detection and content control systems to adversarial attacks, inputs intentionally designed by people to cause the model to make a mistake, has not been widely addressed by public. Copyright detection system are vulnerable to attacks for three reasons.

1. Unlike to physical-world attacks where adversarial samples need to survive under different conditions like resolutions and viewing angles, any digital files can be uploaded directly to the web without going through a camera or microphone.

2. The detection system is open which means the uploaded files may not correspond to an existing class. In this case, it will prevent people from uploading unprotected audio/video whereas most of the uploaded files nowadays are not protected.

3. The detection system needs to handle a vast majority of content which have different labels but similar features. For example, in the ImageNet classification task, the system is easily attacked when there are two cats/dogs/birds with high similarities but from different classes.

In this paper, different types of copyright detection systems will be introduced. A widely used detection model from Shazam, a popular app used for recognizing music, will be discussed. Next, the paper talks about how to generate audio fingerprints using convolutional neural network and formulates the adversarial loss function using standard gradient methods. An example of remixing music is given to show how adversarial examples can be created. Then the adversarial attacks are applied onto industrial systems like AudioTag and YouTube Content ID to evaluate the effectiveness of the systems, and the conclusion is made at the end.

## 3.1 Audio Fingerprinting Model

The audio fingerprinting model plays an important role in copyright detection. Shazam is a popular music recognization application, which uses one of the most well-known fingerprinting models. With three 3 principles: temporally localized, translation invariant, and robustness, The Shazam algorithm is treated as a good fingerprint algorithm. It shows strong robustness even in presence of noise by using local maximum in spectrogram to form hashes.

## 3.2. Interpreting the fingerprint extractor as a CNN

The generic neural network model consists two convolutional layers and a max-pooling layer, depicted in the figure below. As mentioned above, the convolutional neural network is well-known for its properties of temporarily localized and transformational invariant. The purpose of this network is to generate audio fingerprinting signals that extract features that uniquely identify a signal, regardless of the starting and ending time of the inputs.

While an audio sample enters the neural network, it is first transformed by the initial network layer, which can be described as a normalized Hann function. The form of the function is shown below, with N being the width of the Kernel.

$$f_{1}(n)=\frac {sin^2(\frac{\pi n} {N})} {\sum sin^2(\frac{\pi n}{N})}$$

The intention of the normalized Hann function is to smooth the adversarial perturbation of the input audio signal, which removes the discontinuity as well as the bad spectral properties. This transformation enhances the efficiency of black-box attacks that is later implemented.

The next convolutional layer applies a Short Term Fourier Transformation to the input signal by computing the spectrogram of the waveform and converts the input into a feature representation. Once the input signal enters this network layer, it is being transformed by the convolutional function below.

$$f_{2}(k,n)=e^{-i 2 \pi k n / N}$$ where k ${\in}$ 0,1,...,N-1 (output channel index) and n ${\in}$ 0,1,...,N-1 (index of filter coefficient)

The output of this layer is described as φ(x) (x being the input signal), a feature representation of the audio signal sample. However, this representation is flawed due to its vulnerability to noise and perturbation, as well as its difficulty to store and inspect. Therefore, a maximum pooling layer is being implemented to φ(x), in which the network computes a local maximum using a max-pooling function. This network layer outputs a binary fingerprint ψ (x) (x being the input signal) that will be used later to search for a signal against a database of previously processed signals.

## 3.3 Formulating the adversarial loss function

In the previous section, local maxima of spectrogram are used to generate fingerprints by CNN, but a loss has not been quantified how similar tow fingerprints are. After the loss is found, standard gradient methods can be used to find a perturbation ${\delta}$, which can be added to a signal so that the copyright detection system will be tricked. Also, a bound is set to make sure the generated fingerprints are close enough to the original audio signal. $$\text{bound:}\ ||\delta||_p\le\epsilon$$

where ${||\delta||_p\le\epsilon}$ is the ${l_p}$-norm of the perturbation and ${\epsilon}$ is the bound of the difference between the original file and the adversarial example.

To compare how similar two binary fingerprints are, Hamming distance is employed. Hamming distance between two strings is the number of digits that are different. For example, the Hamming distance between 101100 and 100110 is 2.

Let ${\psi(x)}$ and ${\psi(y)}$ be two binary fingerprints outputted from the model, the number of peaks shared by ${x}$ and ${y}$ can be found through ${|\psi(x)\cdot\psi(y)|}$. Now, to get a differentiable loss function, the equation is found to be

$$J(x,y)=|\phi(x)\cdot\psi(x)\cdot\psi(y)|$$

This is effective for white-box attacks with knowing the fingerprinting system. However, the loss can be easily minimized by modifying the location of the peaks by one pixel, which would not be reliable to transfer to black-box industrial systems. To make it more transferable, a new loss function which involves more movements of the local maxima of the spectrogram is proposed. The idea is to move the locations of peaks in ${\psi(x)}$ outside of neighborhood of the peaks of ${\psi(y)}$. In order to implement the model more efficiently, two max-pooling layers are used. One of the layers has a bigger width ${w_1}$ while the other one has a smaller width ${w_2}$. For any location, if the output of ${w_1}$ pooling is strictly greater than the output of ${w_2}$ pooling, then it can be concluded that no peak in that location with radius ${w_2}$.

The loss function is as the following:

$$J(x,y) = \sum_i\bigg(ReLU\bigg(c-\bigg(\underset{|j| \leq w_1}{\max}\phi(i+j;x)-\underset{|j| \leq w_2}{\max}\phi(i+j;x)\bigg)\bigg)\cdot\psi(i;y)\bigg)$$ The equation above penalizes the peaks of ${x}$ which are in neighborhood of peaks of ${y}$ with radius of ${w_2}$. The activation function uses ${ReLU}$. ${c}$ is the difference between the output of two max-pooling layers.

Lastly, instead of the maximum operator, smoothed max function is used here: $$S_\alpha(x_1,x_2,...,x_n) = \frac{\sum_{i=1}^{n}x_ie^{\alpha x_i}}{\sum_{i=1}^{n}e^{\alpha x_i}}$$ where ${\alpha}$ is a smoothing hyper parameter. When ${\alpha}$ approaches positive infinity, ${S_\alpha}$ is closer to the actual max function.

To summarize, the optimization problem can be formulated as the following:

$$\underset{\delta}{\min}J(x+\delta,x)\\ s.t.||\delta||_{\infty}\le\epsilon$$ where ${x}$ is the input signal, ${J}$ is the loss function with the smoothed max function.

## 4. Evaluating transfer attacks on industrial systems

The effectiveness of default and remix adversarial examples is tested through white-box attacks on the proposed model and black-box attacks on two real-world audio copyright detection systems - AudioTag and YouTube “Content ID” system. L_infinity norm and L2 norm of perturbations are two measures of modification. Both of them are calculated after normalizing the signals so that the samples could lie between 0 and 1.

Before evaluating black-box attacks against real-world systems, white-box attacks are used to provide the baseline of adversarial examples’ effectiveness. Loss function (3) is used to generate white-box attacks. The unnoticeable fingerprints of the audio with the noise can be changed or removed by optimizing the loss function.

Table 1: Norms of the perturbations for white-box attacks

In black-box attacks, the AudioTag system is found to be relatively sensitive to the attacks since it can detect the songs with a benign signal while it failed to detect both default and remix adversarial examples. The architecture of the AudioTag fingerprint model and surrogate CNN model is guessed to be similar based on the experimental observations.

Similar to AudioTag, the YouTube “Content ID” system also got the result with successful identification of benign songs but failure to detect adversarial examples. However, to fool the YouTube Content ID system, a larger value of the parameter is required. YouTube Content ID system has a more robust fingerprint model.

Table 2: Norms of the perturbations for black-box attacks

## 5. Conclusion

In conclusion, many industrial copyright detection systems used in the popular video and music website such as YouTube and AudioTag are significantly vulnerable to adversarial attacks established in the existing literature. By building a simple music identification system resembling that of Shazam using neural network and attack it by the well-known gradient method, this paper firmly proved the lack of robustness of the current online detector. The intention of this paper is to raise the awareness of the vulnerability of the current online system to adversarial attacks and to emphasize the significance of enhancing our copyright detection system. More approach, such as adversarial training needs to be developed and examined, in order to protect us against the threat of adversarial copyright attack.