python tensorflow keras gradient-descent multiclass-classification

Doubts with cleverhans FastGradientMethod (FGM), adversarial image generation

I have a keras model (CNN with final softmax) that is an RGB image classifier. Output of the model are 5 possible categories for input images (one-hot encoded). I'm trying to generate adversarial images for my keras model with the Cleverhans (tensorflow library).

A simplified version of my code which generates one adversarial image is the following:

# model is the CNN keras model
wrap = KerasModelWrapper(model)
fgsm = FastGradientMethod(wrap, sess=session)
fgsm_params = {'eps': 16. / 256,
               'clip_min': 0.,
               'clip_max': 1.
               }
x = tf.placeholder(tf.float32, shape=(None, img_rows, img_cols,
                                      nchannels))
adv_x = fgsm.generate(x, **fgsm_params)
# original image is a tensor containing only one RGB image, shape=(1,48,48,3) 
adv_image = adv_x.eval(session=session, feed_dict={x: original_image})

Chapter 1, eps

From my understanding, 'eps' FGM param is the input variation step (minimum change for one image value/pixel).

I have observed that the final outcome is highly affected by eps, sometimes I need high eps in order to obtain an effective adversarial image, an image which effectively changes the category label in respect to the original image.

With low eps sometimes FGM fails to obtain a working adversarial image i.e., having an image O, with label l_O FGM fails to produce adversarial image O' with l_O'!= l_O, e.g., for l_O = [0,0,1,0,0] we still obtain l_O' = [0,0,1,0,0], failing to generate an adversarial image with a different label.

Questions (I'm sorry the problem requires a set of questions):

Does FGM always find out a working adversarial image? i.e., Is it normal that FGM fails?
Is there a way to obtain an estimated quality of the generated adversarial image (without predicting with model)?
Why is the value of eps step so important?
Most important: Is there a way to tell FGM to try harder searching for the adversarial image(e.g, more steps)?

Chapter 2, y,y_target

I have also experimented y and y_target params. Can you also explain me what are the params 'y', 'y_target'?

I thought 'y_target' tells that we want to generate an adversarial image that targets a specific category. For example I thought that y_target = [[0,1,0,0,0]] in feed_dict should force to generate an adversarial image which is classified with the 2th class from the model.

Am I right? ..or
do I miss something?

P.s: my problem is that setting y_target fails to produce adversarial images.

please give me few tips.. ;-) Regards

Solution

I got the answers from Cleverhans developers on github, I quote their answer here:

Chapter 1:

FGSM (like any attack) is not guaranteed to find an adversarial image that is misclassified by the model because it makes approximations when solving the optimization problem that defines an adversarial example.

The attack can fail to find adversarial images for various reasons, one common reason is gradient masking. You can read about it in this blog post and in this paper as well as this paper.

The eps step is important because it is the magnitude of the perturbation. The attack first computes the direction in which to perturb the image (using gradients of the model) and then takes a step of size eps in that direction. Hence, eps corresponds roughly to what intuitively one would think of the "power" of the attack.

You can find a multi-step variant of FGSM in BasicIterativeMethod

Chapter 2:

y is used to specify labels in the case of an untargeted attack (any wrong class is considered as success for the adversary) whereas y_target is used to specify a target class in the targeted attack case (the adversary is successful only if the model makes a particular misclassification in a chosen class).

It is often the case that targeted attacks require more perturbation (i.e., higher eps values in the FGSM case) than untargeted attacks.