Obfuscated Gradients Give a False Sense of Security Circumventing Defenses to Adversarial Examples

From statwiki
Revision as of 16:49, 6 December 2018 by S362khan (talk | contribs) (The Attacks: Cleanup)
Jump to: navigation, search


Over the past few years, neural network models have been the source of major breakthroughs in a variety of computer vision problems. However, these networks have been shown to be susceptible to adversarial attacks. In these attacks, small humanly-imperceptible changes are made to images (that are originally correctly classified) which causes these models to misclassify with high confidence. These attacks pose a major threat that needs to be addressed before these systems can be deployed on a large scale, especially in safety-critical scenarios.

The seriousness of this threat has generated major interest in both the design and defense against them. Recently, many new defenses have been proposed that claim robustness against iterative white-box adversarial attacks. This result is somewhat surprising, given that iterative white-box attacks are one of the most difficult classes of adversarial attacks. In this paper, the authors identify a common flaw, masked gradients, in many of these defenses that cause them to perceive a high accuracy on adversarial images. This flaw is so prevalent, that 7 out of the 9 defenses proposed in the ICLR 2018 conference were found to contain them. The authors develop three attacks, specifically targeting masked gradients, and show that the actual accuracy of these defenses is much lower than claimed. In fact, the majority of these attacks were found to be ineffective against true iterative white box attacks.


The paper assumes a lot of familiarity with adversarial attack literature. The section below briefly explains some key concepts.


Adversarial Images Mathematically

Given an image [math]x[/math] and a classifier [math]f(x)[/math], an adversarial image [math]x'[/math] satisfies two properties:

  1. [math]D(x,x') \lt \epsilon [/math]
  2. [math]c(x') \neq c^*(x) [/math]

Where [math]D[/math] is some distance metric, [math]\epsilon [/math] is a small constant, [math]c(x')[/math] is the output class predicted by the model, and [math]c^*(x)[/math] is the true class for input x. In words, the adversarial image is a small distance from the original image, but the classifier classifies it incorrectly.

Adversarial Attacks Terminology

  1. Adversarial attacks can be either black or white-box. In black box attacks, the attacker has access to the network output only, while white-box attackers have full access to the network, including its gradients, architecture and weights. This makes white-box attackers much more powerful. Given access to gradients, white-box attacks use back propagation to modify inputs (as opposed to the weights) with respect to the loss function.
  2. In untargeted attacks, the objective is to maximize the loss of the true class, [math]x'=x \mathbf{+} \lambda(sign(\nabla_xL(x,c^*(x))))[/math]. While in targeted attacks, the objective is to minimize loss for a target class [math]c^t(x)[/math] that is different from the true class, [math]x'=x \mathbf{-} \epsilon(sign(\nabla_xL(x,c^t(x))))[/math]. Here, [math]\nabla_xL()[/math] is the gradient of the loss function with respect to the input, [math]\lambda[/math] is a small gradient step and [math]sign()[/math] is the sign of the gradient.
  3. An attacker may be allowed to use a single step of back-propagation (single step) or multiple (iterative) steps. Iterative attackers can generate more powerful adversarial images. Typically, to bound iterative attackers a distance measure is used.

In this paper the authors focus on the more difficult attacks; white-box iterative targeted and untargeted attacks.

Obfuscated Gradients

If gradients are masked, they cannot be followed to generate adversarial images. Gradient masking is known to be an incomplete defense to adversarial images[Papernot et al., 2017; Tramer et al., 2018]. A defense method may appear to be providing robustness, but in reality, the gradients in the network cannot be followed to generate strong adversarial images. Generated adversarial images from these networks are much weaker and when used to evaluate the model robustness give a false sense of security against adversarial attacks.

Some defences break gradient descent deliberately, others may do it unintentionally. Some indicators of a broken gradient descent are as follows:

  1. One-step attacks perform better than iterative attacks, which are strictly stronger, so this shouldn’t be the case. If single-step methods are working better, it’s a sign the iterative attack is becoming stuck at a local minimum.
  2. Black-box attacks work better than white-box attacks. The black-box threat model is a strict subset of white-box attacks, so white-box attacks should perform better. When a defense obfuscates gradients, then black-box attacks (which don’t use it) often perform better.
  3. Unbounded attacks do not reach 100% success. With unbounded distortion, any classifier should eventually fail. An attack that doesn’t achieve this should be improved (i.e., it’s a weak attack, not necessarily a strong defense).
  4. Random sampling finds adversarial examples. Random sampling (e.g., randomly sampling [math]10^5[/math] or more points within some [math]\epsilon[/math]-ball) should not find adversarial examples when gradient-based attacks do not.
  5. Increasing the distortion bound does not increase success. Usually, a monotonically increasing attack success rate with increasing distortion bound is expected.

In the defenses proposed in ICLR 2018, there are three ways of obfuscating gradients were found:

  1. Shattered gradients: Non-differentiable operations are introduced into the model, causing a gradient to be nonexistent or incorrect. Introduced by using operations where following the gradient doesn't maximize classification loss globally.
  2. Stochastic gradients: A stochastic process is added into the model at test time, causing the gradients to become randomized. Introduced by either randomly transforming inputs before feeding to the classifier, or randomly permuting the network itself.
  3. Vanishing Gradients : Very deep neural networks or those with recurrent connections are used. Because of the vanishing or exploding gradient problem common in these deep networks, effective gradients at the input are small and not very useful. Introduced by using multiple iterations of neural network evaluation, where the output of one network is fed as the input to the next.

Detecting Obfuscated Gradients:

The authors propose a number of tests that might help detect when a defence relies on obfuscated gradients.

Iterative attacks should work better than single-step attacks, since iterative attacks are strictly stronger than single-step attacks. White-box attacks should perform better than black-box attacks, since the black-box threat model is a strict subset of the white-box threat model. Attacks with an unbounded distortion metric (e.g. L_2 norm) should find adversarial examples with 100% success. Optimization-based attacks should perform better than brute-force sampling of nearby inputs (sampling within an ϵ-ball). These tests may not cover all cases of obfuscated gradients, but they indicate when intuitive properties start to break down. All defences with obfuscated gradients discussed by the authors fail at least one test.

The Attacks

To circumvent these gradient masking techniques, the authors propose:

  1. Backward Pass Differentiable Approximation (BPDA): For defences that introduce non-differentiable components, the authors replace it with an approximate function that is differentiable on the backward pass. In a white-box setting, the attacker has full access to any added non-linear transformation and can find its approximation.
  2. Expectation over Transformation [Athalye, 2017]: For defences that add some form of test time randomness, the authors propose to use expectation over transformation technique in the backward pass. Rather than moving along the gradient every step, several gradients are sampled and the step is taken in the average direction. This can help with any stochastic misdirection from individual gradients. The technique is similar to using mini-batch gradient descent but applied in the construction of adversarial images.
  3. Re-parameterize the exploration space: For very deep networks that rely on vanishing or exploding gradients, the authors propose to re-parameterize and search over the range where the gradient does not explode/vanish.

Main Results

Summary Table.png

The table above summarizes the results of their attacks. Attacks are mounted on the same dataset each defence targeted. If multiple datasets were used, attacks were performed on the largest one. Two different distance metrics ([math]\ell_{\infty}[/math] and [math]\ell_{2}[/math]) were used in the construction of adversarial images. Distance metrics specify how much an adversarial image can vary from an original image. For [math]\ell_{\infty}[/math] adversarial images, each pixel is allowed to vary by a maximum amount. For example, [math]\ell_{\infty}=0.031[/math] specifies that each pixel can vary by [math]256*0.031=8[/math] from its original value. [math]\ell_{2}[/math] distances specify the magnitude of the total distortion allowed over all pixels. For MNIST and CIFAR-10, untargeted adversarial images were constructed using the entire test set, while for Imagenet, 1000 test images were randomly selected and used to generate targeted adversarial images.

Standard models were used in evaluating the accuracy of defense strategies under the attacks,

  1. MNIST: 5-layer Convolutional Neural Network (99.3% top-1 accuracy)
  2. CIFAR-10: Wide-Resnet (95.0% top-1 accuracy)
  3. Imagenet: InceptionV3 (78.0% top-1 accuracy)

The last column shows the accuracies each defence method achieved over the adversarial test set. Except for [Madry, 2018], all defence methods could only achieve an accuracy of <10%. Furthermore, the accuracy of most methods was 0%. The results of [Samangoui,2018] (double asterisk), show that their approach was not as successful. The authors claim that is is a result of implementation imperfections but theoretically, the defense can be circumvented using their proposed method.

The defense that worked - Adversarial Training [Madry, 2018]

As a defense mechanism, [Madry, 2018] proposes training the neural networks with adversarial images. Although this approach is previously known [Szegedy, 2013] in their formulation, the problem is setup in a more systematic way using a min-max formulation: \begin{align} \theta^* = \arg \underset{\theta} \min \mathop{\mathbb{E_x}} \bigg{[} \underset{\delta \in [-\epsilon,\epsilon]}\max L(x+\delta,y;\theta)\bigg{]} \end{align}

where [math]\theta[/math] is the parameter of the model, [math]\theta^*[/math] is the optimal set of parameters and [math]\delta[/math] is a small perturbation to the input image [math]x[/math] and is bounded by [math][-\epsilon,\epsilon][/math].

Training proceeds in the following way. For each clean input image, a distorted version of the image is found by maximizing the inner maximization problem for a fixed number of iterations. Gradient steps are constrained to fall within the allowed range (projected gradient descent). Next, the classification problem is solved by minimizing the outer minimization problem.

This approach was shown to provide resilience to all types of adversarial attacks.

How to check for Obfuscated Gradients

For future defence proposals, it is recommended to avoid using masked gradients. To assist with this, the authors propose a set of conditions that can help identify if a defence is relying on masked gradients:

  1. If weaker one-step attacks are performing better than iterative attacks.
  2. Black-box attacks can find stronger adversarial images compared with white-box attacks.
  3. Unbounded iterative attacks do not reach 100% success.
  4. If random brute force attempts are better than gradient-based methods at finding adversarial images.

Detailed Results

As a case study for evaluating the prevalence of obfuscated gradients, the authors studied the ICLR 2018 non-certified defenses that argue robustness in a white-box threat model. Each of these defenses argues a high robustness to adaptive, white box attacks. It is reported that seven of these nine defenses depend on this phenomenon, and the authors demonstrate that their techniques can completely circumvent six of those (and partially circumvent one) that depend on obfuscated gradients.

Non-obfuscated Gradients

Cascade Adversarial Training, [Na, 2018]

Defense: Similar to the method of [Madry, 2018], the authors of [Na, 2018] propose adversarial training. The main difference is that instead of using iterative methods to generate adversarial examples at each mini-batch, a separate model is first trained and used to generate adversarial images. These adversarial images are used to augment the train set of another model.

Attack: The authors found that this technique does not use obfuscated gradients. They were not able to reduce the performance of this method. However, they point out that the claimed accuracy is much lower (%15) compared with [Madry, 2018] under the same perturbation setting.

Gradient Shattering

Thermometer Coding, [Buckman, 2018]

Defense: Inspired by the observation that neural networks learn linear boundaries between classes [Goodfellow, 2014] , [Buckman, 2018] sought to break this linearity by explicitly adding a highly non-linear transform at the input of their model. The non-linear transformation they chose was quantizing inputs to binary vectors. The quantization performed was termed thermometer encoding,

Given an image, for each pixel value [math]x_{i,j,c}[/math], if an [math]l[/math] dimensional thermometer code, the [math]kth[/math] bit is given by: \begin{align} \tau(x_{i,j,c})_k = \bigg{\{}\begin{array}{ll} 1 \space if \thinspace x_{i,j,c} > \dfrac{k}{l} \\ 0 \space otherwise \\ \end{array} \end{align} Here it is assumed [math]x_{i,j,c} \in [0, 1] [/math] and [math]i, j, c[/math] are the row, column and channel index of the pixel respectively. This encoding is like one-hot encoding, except all the points (not just one) greater than the target value are set to 1. This quantization technique preserves pairwise ordering between pixels.

On CIFAR-10, the model gave 50% accuracy against [math]\ell_\infty[/math] adversarial images with [math]\epsilon=0.031[/math] attacks.

Attack: The authors attack this model using there BPDA approach. Given the non-linear transformation performed in the forward pass, [math]\tau(x)[/math], they develop a differentiable counterpart, \begin{align} \hat{\tau}(x_{i,j,c})_k = \min ( \max (x_{i,j,c} - \frac{k}{l}), 1 ) \end{align} and use it in place of [math]\tau(x)[/math] on the backward pass. With their modifications they were able to bring the accuracy of the model down to 0%.

Input Transformation, [Guo, 2018]

Defence:[Gou, 2018] investigated the effect of including different input transformation on the robustness to adversarial images. As a baseline, the authors evaluate image cropping and rescaling, bit-depth reduction, and JPEG compression. In particular, they found two techniques provided the greatest resistance: total variance minimization and image quilting. Total variance minimization is a technique that removes high-frequency noise while preserving legitimate edges (good high-frequency components). In image quilting, a large database of image patches from clean images is collected. At test time, input patches, that contain a lot of noise, are replaced with similar but clean patches from the database. The authors explore different combinations of input transformations along with different underlying ImageNet classifiers, including adversarially trained models. They find that input transformations provide protection even with a vanilla classifier.

Both techniques, removed perturbations from adversarial images which provide some robustness to adversarial attacks. The best model achieved 60% accuracy on adversarial images with [math]l_{2}=0.05[/math] perturbations. However, both approaches are non-differentiable and contain test time randomness as the modifications made are input dependent. Gradient flow to the input is non-differentiable and random.

Attack: The authors used the BPDA attack where the input transformations were replaced by an identity function. They were able to bring the accuracy of the model down to 0% under the same type of adversarial attacks.

Local Intrinsic Dimensionality, [Ma, 2018]

Defense Local intrinsic dimensionality (LID) is a distance-based metric that measures the similarity between points in a high dimensional space. Given a set of points, let the distance between sample [math]x[/math] and its [math]ith[/math] neighbor be [math]r_i(x)[/math], then the LID under the choose distance metric is given by,

\begin{align} LID(x) = - \bigg{(} \frac{1}{k}\sum^k_{i=1}log \frac{r_i(x)}{r_k(x)} \bigg{)}^{-1} \end{align} where k is the number of nearest neighbors considered, [math]r_k(x)[/math] is the maximum distance to any of the neighbors in the set k.

First, [math]L_2[/math] distances for all training and adversarial images. Next, the LID scores for each train and adversarial images were calculated. It was found that LID scores for adversarial images were significantly larger than those of clean images. Base on these results, the a separate classifier was created that can be used to detect adversarial inputs. [Ma, 2018] claim that this is not a defence method, but a method to study the properties of adversarial images.

Attack: Instead of attacking this method, the authors show that this method is not able to detect, and is therefore venerable to, attacks of the [Carlini and Wagner, 2017a] variety.

Stochastic Gradients

Stochastic Activation Pruning, [Dhillon, 2018]

Defence: [Dhillon, 2018] use test time randomness in their model to guard against adversarial attacks. Because adversarial perturbations are like noises, randomly dropping activation can decrease their collective impact on the classifier. Within a layer, the activities of component nodes are randomly dropped with a probability proportional to its absolute value. The rest of the activation are scaled up to preserve accuracies. This is akin to test time drop-out. This technique was found to drop accuracy slightly on clean images, but improved performance on adversarial images.

Attack: The authors used the expectation over transformation attack to get useful gradients out of the model. With their attack, they were able to reduce the accuracy of this method down to 0% on CIFAR-10.

Mitigation Through Randomization, [Xie, 2018]

Defence: [Xie, 2018] Add a randomization layer to their model to help defend against adversarial attacks. For an input image of size [299,299], first the image is randomly re-scaled to [math]r \in [299,331][/math]. Next, the image is zero-padded to fix the dimension of the modified input. This modified input is then fed into a regular classifier. The authors claim that is strategy can provide an accuracy of 32.8% against ensemble attack patterns (fixed distortions, but many of them which are picked randomly). Because of the introduced randomness, the authors claim the model builds some robustness to other types of attacks as well.

Attack: The EOT method was used to build adversarial images to attack this model. With their attack, the authors were able to bring the accuracy of this model down to 0% using [math]L_{\infty}(\epsilon=0.031)[/math] perturbations.

Vanishing and Exploding Gradients

Pixel Defend, [Song, 2018]

Defence: [Song, 2018] argues that adversarial images lie in low probability regions of the data manifold. Therefore, one way to handle adversarial attacks is to project them back into the high probability regions before feeding them into a classifier. They chose to do this by using a generative model (pixelCNN) in a denoising capacity. A PixelCNN model directly estimates the conditional probability of generating an image pixel by pixel [Van den Oord, 2016],

\begin{align} p(\mathbf{x}= \prod_{i=1}^{n^2} p(x_i|x_0,x_1 ....x_{i-1})) \end{align}

The reason for choosing this model is the long iterative process of generation. In the backward pass, following the gradient, all the way to the input would not be possible because of the vanishing/exploding gradient problem of deep networks. The proposed model was able to obtain an accuracy of 46% on CIFAR-10 images with [math]l_{\infty} (\epsilon=0.031) [/math] perturbations.

Attack: The model was attacked using the BPDA technique where back-propagating though the pixelCNN was replaced with an identity function. With this approach, the authors were able to bring down the accuracy to 9% under the same kind of perturbations.

Defence-GAN, [Samangouei, 2018]

Before classifying the samples, Defence-GAN projects them onto the data manifold utilizing GAN. The intuition behind this approach is almost similar to that of PixelDefend. It uses GAN instead of pixel CNN.

The authors used MNIST because CIFAR-10 is not argued secure. They found adversarial examples exist in the generator manifold, and they can construct an example. A perfect projector will not be able to modify this example, however, an imperfect gradient descent approach does not perfectly preserve manifold points. Therefore, the authors attacked DEFENSE-GAN using BPDA, but can only get a 45% success rate.


In this paper, it was found that gradient masking is a common flaw in many defences claiming robustness against white box adversarial attacks. This leads to a perceived robustness against adversarial attacks when in reality it results in weaker adversarial image construction. The authors develop three attacks that can overcome gradient masking. With their attacks, they found that actual robustness of 7 out of the 9 defences proposed in ICLR-2018, is significantly lower. In fact, many defences were found to be completely ineffective.

Some future work that can come out of this paper includes avoiding relying on obfuscated gradients for perceived robustness and use the evaluation approach to detect when the attack occurs. Early categorization of attacks using some supervised techniques can also help in critical evaluations of incoming data.


  1. The third attack method, reparameterization of the input distortion search space was presented very briefly and at a very high level. Moreover, the one defense proposal they chose to use it against, [Samangouei, 2018] prove to be resilient against the attack. The authors had to resort to one of their other methods to circumvent the defense.
  2. The BPDA and reparameterization attacks require intrinsic knowledge of the networks. This information is not likely to be available to external users of a network. Most likely, the use-case for these attacks will be in-house to develop more robust networks. This also means that it is still possible to guard against adversarial attack using gradient masking techniques, provided the details of the network are kept secret.
    1. A notable exception to this case could be applications that are built using open-source (or even published) models that are paired with model-agnostic defense mechanisms. For example, A ResNet-50 using the model-agnostic 'input transformations' technique by [Guo, 2018] may be used in many different image classification tasks, but could still be successfully attacked using BPDA.
  3. The BPDA algorithm requires replacing a non-linear part of the model with a differentiable approximation. Since different networks are likely to use different transformations, this technique is not plug-and-play. For each network, the attack needs to be manually constructed.
  4. In general, the research field of adversarial attack would benefit from having an all-encompassing benchmark or dataset, so that the various approaches can be objectively compared and evaluated.

Other Sources

  1. Their re-implementation of each of the defenses and implementations of the attacks are available here.


  1. [Madry, 2018] Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A., 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
  2. [Buckman, 2018] Buckman, J., Roy, A., Raffel, C. and Goodfellow, I., 2018. Thermometer encoding: One hot way to resist adversarial examples.
  3. [Guo, 2018] Guo, C., Rana, M., Cisse, M. and van der Maaten, L., 2017. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117.
  4. [Xie, 2018] Xie, C., Wang, J., Zhang, Z., Ren, Z. and Yuille, A., 2017. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991.
  5. [song, 2018] Song, Y., Kim, T., Nowozin, S., Ermon, S. and Kushman, N., 2017. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766.
  6. [Szegedy, 2013] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. and Fergus, R., 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
  7. [Samangouei, 2018] Samangouei, P., Kabkab, M. and Chellappa, R., 2018. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605.
  8. [van den Oord, 2016] van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O. and Graves, A., 2016. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (pp. 4790-4798).
  9. [Athalye, 2017] Athalye, A. and Sutskever, I., 2017. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397.
  10. [Ma, 2018] Ma, Xingjun, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Michael E. Houle, Grant Schoenebeck, Dawn Song, and James Bailey. "Characterizing adversarial subspaces using local intrinsic dimensionality." arXiv preprint arXiv:1801.02613 (2018).
  11. [Na, 2018] Na, T., Ko, J.H. and Mukhopadhyay, S., 2017. Cascade Adversarial Machine Learning Regularized with a Unified Embedding. arXiv preprint arXiv:1708.02582.
  12. [Papernot et al., 2017] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, pp. 506–519, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4944-4.
  13. [Tramer et al., 2018] Tramer, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., and McDaniel, P. Ensemble adversarial training: Attacks and defenses. International Conference on Learning Representations, 2018.