Introduction

Over the past few years,

Methodology

The paper assumes a lot of familiarity with adversarial attack literature. The section below briefly explains some key concepts.

Background

Adversarial Images Mathematically

Given an image [math]\displaystyle{ x }[/math] and a classifier [math]\displaystyle{ f(x) }[/math], an adversarial image [math]\displaystyle{ x' }[/math] satisfies two properties:

[math]\displaystyle{ D(x,x') \lt \epsilon }[/math]
[math]\displaystyle{ c(x') \neq c^*(x) }[/math]

Where [math]\displaystyle{ D }[/math] is some distance metric, [math]\displaystyle{ \epsilon }[/math] is a small constant, [math]\displaystyle{ c(x') }[/math] is the output class predicted by the model, and [math]\displaystyle{ c^*(x) }[/math] is the true class for input x. In words, the adversarial image is a small distance from the original image, but the classifier classifies it incorrectly.

Adversarial Attacks Terminology

Adversarial attacks can be either black or white-box. In black box attacks, the attacker has access to the network output only, while white-box attackers have full access to the network, including its gradients, architecture and weights. This makes white-box attackers much more powerful. Given access to gradients, white-box attacks use back propagation to modify inputs (as opposed to the weights) with respect to the loss function.
In untargeted attacks, the objective is to maximize the loss of the true class, [math]\displaystyle{ x'=x \mathbf{+} \lambda(sign(\nabla_xL(x,c^*(x)))) }[/math]. While in targeted attacks, the objective is to minimize loss for a target class [math]\displaystyle{ c^t(x) }[/math] that is different from the true class, [math]\displaystyle{ x'=x \mathbf{-} \epsilon(sign(\nabla_xL(x,c^t(x)))) }[/math]. Here, [math]\displaystyle{ \nabla_xL() }[/math] is the gradient of the loss function with respect to the input, [math]\displaystyle{ \lambda }[/math] is a small gradient step and [math]\displaystyle{ sign() }[/math] is the sign of the gradient.
An attacker may be allowed to use a single step of back-propagation (single step) or multiple (iterative) steps. Iterative attackers can generate more powerful adversarial images. Typically, to bound iterative attackers a distance measure is used.

In this paper the authors focus on the more difficult attacks; white-box iterative targeted and untargeted attacks.

Obfuscated Gradients

As gradients are used in the generation of white-box adversarial images, many defense strategies have focused on methods that mask gradients. If gradients are masked, they cannot be followed to generate adversarial images. The authors argue against this general approach by showing that it can be easily circumvented. To emphasize their point, they looked at white-box defenses proposed in ICLR 2018. Three types of gradient masking techniques were found:

Shattered gradients: Non-differentiable operations are introduced into the model, causing a gradient to be nonexistent or incorrect. Introduced by using operations where following the gradient doesn't maximize classification loss globally.
Stochastic gradients: A stochastic process is added into the model at test time, causing the gradients to become randomized. Introduced by either randomly transforming inputs before feeding to the classifier, or randomly permuting the network itself.
Vanishing Gradients : Very deep neural networks or those with recurrent connections are used. Because of the vanishing or exploding gradient problem common in these deep networks, effective gradients at the input are small and not very useful. Introduced by using multiple iterations of neural network evaluation, where the output of one network is fed as the input to the next.

The Attacks

To circumvent these gradient masking techniques, the authors propose:

Backward Pass Differentiable Approximation (BPDA): For defenses that introduce non-differentiable components, the authors replace it with an approximate function that is differentiable on the backward pass. In a white-box setting, the attacker has full access to any added non-linear transformation and can find its approximation.
Expectation over Transformation [Athalye, 2017]: For defenses that add some form of test time randomness, the authors propose to use expectation over transformation technique in the backward pass. Rather than moving along the gradient every step, several gradients are sampled and the step is taken in the average direction. This can help with any stochastic misdirection from individual gradients. The technique is similar to using mini-batch gradient descent but applied in the construction of adversarial images.
Re-parameterize the exploration space: For very deep networks that rely on vanishing or exploding gradients, the authors propose to re-parameterize and search over the range where the gradient does not explode/vanish.

Main Results

The table above summarizes the results of their attacks. Attacks are mounted on the same dataset each defense targeted. If multiple datasets were used, attacks were performed on the largest one. Two different distance metrics ([math]\displaystyle{ \ell_{\infty} }[/math] and [math]\displaystyle{ \ell_{2} }[/math]) were used in the construction of adversarial images. Distance metrics specify how much an adversarial image can vary from an original image. For [math]\displaystyle{ \ell_{\infty} }[/math] adversarial images, each pixel is allowed to vary by a maximum amount. For example, [math]\displaystyle{ \ell_{\infty}=0.031 }[/math] specifies that each pixel can vary by [math]\displaystyle{ 256*0.031=8 }[/math] from its original value. [math]\displaystyle{ \ell_{2} }[/math] distances specify the magnitude of the total distortion allowed over all pixels. For MNIST and CIFAR-10, untargeted adversarial images were constructed using the entire test set, while for Imagenet, 1000 test images were randomly selected and used to generate targeted adversarial images.

Standard models were used in evaluating the accuracy of defense strategies under the attacks,

MNIST: 5-layer Convolutional Neural Network (99.3% top-1 accuracy)
CIFAR-10: Wide-Resnet (95.0% top-1 accuracy)
Imagenet: InceptionV3 (78.0% top-1 accuracy)

The last column shows the accuracies each defense method achieved over the adversarial test set. Except for [Madry, 2018], all defense methods could only achieve an accuracy of <10%. Furthermore, the accuracy of most methods was 0%. The results of [Samangoui,2018] (double asterisk), show that their approach was not as successful. The authors claim that is is a result of implementation imperfections but theoretically the defense can be circumvented using their proposed method.

The defense that worked - Adversarial Training [Madary, 2018]

As a defense mechanism, [Madry, 2018] proposes training the neural networks with adversarial images. Although this approach is previously known [Szegedy, 2013] in their formulation, the problem is setup in a more systematic way using a min-max formulation: \begin{align} \theta^* = \arg \underset{\theta} \min \mathop{\mathbb{E_x}} \bigg{[} \underset{\delta \in [-\epsilon,\epsilon]}\max L(x+\delta,y;\theta)\bigg{]} \end{align}

where [math]\displaystyle{ \theta }[/math] is the parameter of the model, [math]\displaystyle{ \theta^* }[/math] is the optimal set of parameters and [math]\displaystyle{ \delta }[/math] is a small perturbation to the input image [math]\displaystyle{ x }[/math] and is bounded by [math]\displaystyle{ [-\epsilon,\epsilon] }[/math].

Training proceeds in the following way. For each clean input image, a distorted version of the image is found by maximizing the inner maximization problem for a fixed number of iterations. Gradient steps are constrained to fall within the allowed range (projected gradient descent). Next, the classification problem is solved by minimizing the outer minimization problem.

This approach was shown to provide resilience to all types of adversarial attacks.

How to check for Obfuscated Gradients

For future defense proposals, it is recommended to avoid using masked gradients. To assist with this, the authors propose a set of conditions that can help identify if defense is relying on masked gradients:

If weaker one-step attacks are performing better than iterative attacks.
Black-box attacks can find stronger adversarial images compared with white-box attacks.
Unbounded iterative attacks do not reach 100% success.
If random brute force attempts are better than gradient based methods at finding adversarial images.

Recommendations for future defense methods to encourage reproducibility

Detailed Results

Non-obfuscated Gradients

Cascade Adversarial Training, [Na, 2018]

Defense: Since to the method of [Madry, 2018], the authors of [Na, 2018] propose a new training method. The main difference is that instead of using iterative methods to generate adversarial examples at each mini-batch, a separate model is first trained and used to generate adversarial images. These adversarial images are used to augment the train set of another model.

Attack: The authors found that this technique does not use obfuscated gradients. They were not able to reduce the performance of this method. However, they point out that the claimed accuracy is much lower (%15) compared with [Madry, 2018] under the same perturbation setting.

Gradient Shattering

Thermometer Coding, [Buckman, 2018]

Defense: Inspired by the observation that neural networks learn linear boundaries between classes [Goodfellow, 2014] , [Buckman, 2018] sought to break this linearity by explicitly adding a highly non-linear transform at the input of their model. The non-linear transformation they chose was quantizing inputs to binary vectors. The quantization performed was termed thermometer encoding,

Given an image, for each pixel value [math]\displaystyle{ x_{i,j,c} }[/math], if an [math]\displaystyle{ l }[/math] dimensional thermometer code, the [math]\displaystyle{ kth }[/math] bit is given by: \begin{align} \tau(x_{i,j,c})_k = \bigg{\{}\begin{array}{ll} 1 \space if \thinspace x_{i,j,c} >k/l \\ 0 \space otherwise \\ \end{array} \end{align} Here it is assumed [math]\displaystyle{ x_{i,j,c} \in [0, 1] }[/math] and [math]\displaystyle{ i, j, c }[/math] are the row, column and channel index of the pixel respectively. This encoding is essentially like one-hot encoding, except all the points (not just one) greater than the target value are set to 1. This quantization technique preserves pairwise ordering between pixels.

On CIFAR-10, the model gave 50% accuracy against [math]\displaystyle{ \ell_\infty }[/math] adversarial images with [math]\displaystyle{ \epsilon=0.031 }[/math] attacks.

Attack: The authors attack this model using there BPDA approach. Given the non-linear transformation performed in the forward pass, [math]\displaystyle{ \tau(x) }[/math], they develop a differentiable counterpart, \begin{align} \hat{\tau}(x_{i,j,c})_k = \min ( \max (x_{i,j,c} - \frac{k}{l}), 1 ) \end{align} and use it in place of [math]\displaystyle{ \tau(x) }[/math] on the backward pass. With their modifications they were able to bring the accuracy of the model down to 0%.

Input Transformation, [Guo, 2018]

Defense:[Gou, 2018] investigated the effect of including different input transformation on the robustness to adversarial images. In particular, they found two techniques provided the greatest resistance: total variance minimization and image quilting. Total variance minimization is a technique that removes high frequency noise while preserving legitimate edges (good high frequency components). In image quilting, a large database of image patches from clean images is collected. At test time, input patches, that contain a lot of noise, are replaced with similar but clean patches from the data base.

Both techniques, removed perturbations from adversarial images which provides some robustness to adversarial attacks. Moreover, both approaches are non-differentiable which makes constructing white-box adversarial images difficult. Moreover, the techniques also include test time randomness as the modifications made are input dependent. The best model achieved 60% accuracy on adversarial images with [math]\displaystyle{ l_{2}=0.05 }[/math] perturbations.

Attack: The authors used the BPDA attack where the input transformations were replaced by an identity function. They were able to bring the accuracy of the model down to 0% under the same type of adversarial attacks.

Local Intrinsic Dimensionality, [Ma, 2018]

Defense Local intrinsic dimensionality (LID) is a distance-based metric that measures the similarity between points in a high dimensional space. Given a set of points, let the distance between sample [math]\displaystyle{ x }[/math] and its [math]\displaystyle{ ith }[/math] neighbor be [math]\displaystyle{ r_i(x) }[/math], then the LID under the choose distance metric is given by,

\begin{align} LID(x) = - \bigg{(} \frac{1}{k}\sum^k_{i=1}log \frac{r_i(x)}{r_k(x)} \bigg{)}^{-1} \end{align} where k is the number of nearest neighbors considered, [math]\displaystyle{ r_k(x) }[/math] is the maximum distance to any of the neighbors in the set k.

First, [math]\displaystyle{ L_2 }[/math] distances for all training and adversarial images. Next, the LID scores for each train and adversarial images were calculated. It was found that LID scores for adversarial images were significantly larger than those of clean images. Base on these results, the a separate classifier was created that can be used to detect adversarial inputs. [Ma, 2018] claim that this is not a defense method, but a method to study the properties of adversarial images.

Attack: Instead of attacking this method, the authors show that this method is not able to detect, and is therefore venerable to, attacks of the [Carlini and Wagner, 2017a] variety.

Stochastic Gradients

Stochastic Activation Pruning, [Dhillon, 2018]

Defense: [Dhillon, 2018] use test time randomness in their model to guard against adversarial attacks. Within a layer, the activities of component nodes are randomly dropped with a probability proportional to its absolute value. The rest of the activation are scaled up to preserve accuracies. This is akin to test time drop-out. This technique was found to drop accuracy slightly on clean images, but improved performance on adversarial images.

Attack: The authors used the expectation over transformation attack to get useful gradients out of the model. With their attack they were able to reduce the accuracy of this method down to 0% on CIFAR-10.

Mitigation Through Randomization, [Xie, 2018]

Defense: [Xie, 2018] Add a randomization layer to their model to help defend against adversarial attacks. For an input image of size [299,299], first the image is randomly re-scaled to [math]\displaystyle{ r \in [299,331] }[/math]. Next the image is zero-padded to fix the dimension of the modified input. This modified input is then fed into a regular classifier. The authors claim that is strategy can provide an accuracy of 32.8% against ensemble attack patterns (fixed distortions, but many of them which are picked randomly). Because of the introduced randomness, the authors claim the model builds some robustness to other types of attacks as well.

Attack: The EOT method was used to build adversarial images to attack this model. With their attack, the authors were able to bring the accuracy of this model down to 0% using [math]\displaystyle{ L_{\infty}(\epsilon=0.031) }[/math] perturbations.

Vanishing and Exploding Gradients

Pixel Defend, [Song, 2018]

Defense:

The reason for choosing this model is the long iterative process of generation. In the backward pass, following the gradient all the way to the input would not be possible because of the vanishing/exploding gradient problem of deep networks. The proposed model was able to obtain an accuracy of 46% on CIFAR-10 images with [math]\displaystyle{ l_{\infty} (\epsilon=0.031) }[/math] perturbations.

Attack: The model was attacked using the BPDA technique where back-propagating though the pixelCNN was replaced with an identity function. With this apporach, the authors were able to bring down the accuracy to 9% under the same kind of perturbations.

Defense-GAN, [Samangouei, 2018]

Conclusion

In this paper,

Critique

The third attack method,

Other Sources

Their re-implementation of each of the defenses and implementations of the attacks are available here.

References

[Madry, 2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A., 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.

Unsupervised Machine Translation Using Monolingual Corpora Only

Contents

Introduction

Methodology

Background

Adversarial Images Mathematically

Adversarial Attacks Terminology

Obfuscated Gradients

The Attacks

Main Results

The defense that worked - Adversarial Training [Madary, 2018]

How to check for Obfuscated Gradients

Recommendations for future defense methods to encourage reproducibility

Detailed Results

Non-obfuscated Gradients

Cascade Adversarial Training, [Na, 2018]

Gradient Shattering

Thermometer Coding, [Buckman, 2018]

Input Transformation, [Guo, 2018]

Local Intrinsic Dimensionality, [Ma, 2018]

Stochastic Gradients

Stochastic Activation Pruning, [Dhillon, 2018]

Mitigation Through Randomization, [Xie, 2018]

Vanishing and Exploding Gradients

Pixel Defend, [Song, 2018]

Defense-GAN, [Samangouei, 2018]

Conclusion

Critique

Other Sources

References

Navigation menu

Unsupervised Machine Translation Using Monolingual Corpora Only

Introduction

Methodology

Background

Adversarial Images Mathematically

Adversarial Attacks Terminology

Obfuscated Gradients

The Attacks

Main Results

The defense that worked - Adversarial Training [Madary, 2018]

How to check for Obfuscated Gradients

Recommendations for future defense methods to encourage reproducibility

Detailed Results

Non-obfuscated Gradients

Cascade Adversarial Training, [Na, 2018]

Gradient Shattering

Thermometer Coding, [Buckman, 2018]

Input Transformation, [Guo, 2018]

Local Intrinsic Dimensionality, [Ma, 2018]

Stochastic Gradients

Stochastic Activation Pruning, [Dhillon, 2018]

Mitigation Through Randomization, [Xie, 2018]

Vanishing and Exploding Gradients

Pixel Defend, [Song, 2018]

Defense-GAN, [Samangouei, 2018]

Conclusion

Critique

Other Sources

References

Navigation menu

Search