python tensorflow machine-learning deep-learning cleverhans

Generating adversarial data from cleverhans attack models

I want a code example to how to generate train data from clever hans' adversarial attacks.

adv_x = fgsm.generate_np(X_test, **fgsm_params)

This generates adversarial x data but how can I get y?

adv_pred = model.predict_classes(adv_x)

And this will give the "fooled" results right?

What I want is to correctly show generated x, y, fooled y (by which I mean results of models predictions that may be false because of the attack). I'm using Mnist btw, if it helps.

Solution

Based on the code snippets you shared, I would make two suggestions:

It is generally not a good idea to train the model on test data (if you are going to use that test data to evaluate its performance afterwards) so I would replace X_test by X_train in your first line.
To get the label for your adversarial examples, you can use the original labels of the training data or the predictions of the model on the original training data model.predict_classes(X_train) (this assumes that the adversarial example is not perturbed enough to change the label of the input).