Obfuscated Gradients Give a False Sense of Security Circumventing Defenses to Adversarial Examples

From statwiki
Jump to navigation Jump to search

Introduction

Over the past few years, neural network models have been the source of major breakthroughs in a variety of computer vision problems. However, these networks have been shown to be susceptible to adversarial attacks. In these attacks, small humanly-imperceptible changes are made to images (that are correctly classified) which causes these models to misclassify with high confidence. These attacks pose a major threat that needs to be addressed before these systems can be deployed on a large scale, especially in safety-critical scenarios.

The seriousness of this threat has generated major interest in both the design and defense against them. In this paper, the authors identify a common technique employed by several recently proposed defenses and design a set of attacks that can be used to overcome them. The use of this technique, masking gradients, is so prevalent, that 7 out of the 8 defenses proposed in the ICLR 2018 conference employed them. The authors were able to circumvent the proposed defenses and successfully brought down the accuracy of their models to below 10%.

Methodology

The papers assume a lot of familiarity with adversarial attack literature. The section below briefly explains some key concepts.

Background

Adversarial Images Mathematically

Given an image [math]\displaystyle{ x }[/math] and a classifier [math]\displaystyle{ f(x) }[/math], an adversarial image [math]\displaystyle{ x' }[/math] satisfies two properties:

  1. [math]\displaystyle{ D(x,x') \lt \epsilon }[/math]
  2. [math]\displaystyle{ c(x') \neq c^*(x) }[/math]

Where [math]\displaystyle{ D }[/math] is some distance metric, [math]\displaystyle{ \epsilon }[/math] is a small constant, [math]\displaystyle{ c(x') }[/math] is the output class predicted by the model, and [math]\displaystyle{ c^*(x) }[/math] is the true class for input x. In words, the adversarial image is a small distance from the original image, but the classifier classifies it incorrectly.

Adversarial Attacks Terminology

  1. Adversarial attacks can be either black or white-box. In black box attacks, the attack model only has access to the network output. While white-box attackers have full access to the network including its gradients, architecture and weights. This makes white-box attackers are much more powerful. Given access to gradients, white-box attacks use back propagation to modify inputs (as opposed to the weights of the network) with respect to the loss function.
  2. In untargeted attacks, the objective is to maximize the loss of the true class, [math]\displaystyle{ x'=x \mathbf{+} \lambda(sign(\nabla_xL(x,c^*(x)))) }[/math]. While in targeted attacks, the objective is to minimize loss for a target class [math]\displaystyle{ c^t(x) }[/math] that is not the true class, [math]\displaystyle{ x'=x \mathbf{-} \epsilon(sign(\nabla_xL(x,c^t(x)))) }[/math]. Here, [math]\displaystyle{ L() }[/math] is the loss function, [math]\displaystyle{ \nabla_x }[/math] is the graident of the loss function with respect to the input and [math]\displaystyle{ \lambda }[/math] is a small gradient step.
  3. An attacker may be allowed to use a single step of backpropagation (single step) or multiple (iterative) steps. Iterative attackers generate more powerful adversarial images. Typically, to bound iterative attackers a distance measure is used.

In this paper the authors focus on the more difficult attacks; white-box iterative targeted and untargeted attacks.

Obfuscated Gradients

As gradients are used in the generation of white box adversarial images, many defense strategies have focused on masking the gradients such that these images cannot be constructed. The authors argue against this general approach and show that it can be easily circumvented. To emphasize their point, they looked at defense methods proposed in ICLR 2018. Three types of masking techniques were found:

  1. Shattered gradients: Non-differentiable operations are introduced into the model.
  2. Stochastic gradients: A stochastic process is added into the model at test time.
  3. Vanishing Gradients : Very deep neural networks or those with recurrent connections are used. Because of the vanishing or exploding gradient problem common in these networks, following the gradient to modify the input image is difficult.

To circumvent these, the authors propose:

  1. Backward Pass Differentiable Approximation (BPDA): The part of the network that is non-differentiable can be replaced by an approximate function that is differentiable during the constructing of the adversarial image. IN a white-box setting, the attack has full access to any non-linear transformation performed and can find its approximation.
  2. Expectation over Transformation: Rather than moving along the gradient every step, sample many gradients and move in the average direction. This can help with any stochastic misdirection from individual gradients. This is a technique proposed by the authors previously
  3. Re-parametrize the exploration space to the range where gradients do not explode/vanish.

Summary Results

Detailed Results

Conclusion

Critique

References