machine-learning neural-network keras deep-learning conv-neural-network

What is the appropriate penultimate layer for Grad-CAM visualization on Inception V3?

I've been trying to visualize heatmaps for Inception V3. It was my understanding the penultimate layer should be the last convolutional layer, which would be conv2d_94 (idx 299). However, this gives very coarse maps (big regions). I tried to use another layer mixed10 (idx 310) as suggested in this notebook for issue as described here and while the regions are smaller, it still doesn't look great. Some others do seem to use conv2d_94, like here.

I understand it might indicate my model is simply not paying attention to the right things, but also conceptually I'm confused which layer should be used. What is an appropriate penultimate layer?

I'm using Keras 2.2.0 with visualize_cam from keras-vis.

heatmap = visualize_cam(model, layer_idx, filter_indices=classnum, seed_input=preprocess_img, backprop_modifier=None)

Where layer_idx is the idx of dense_2.

I've tried not defining penultimate_layer, which according to the documentation sets the parameter to the nearest penultimate Conv or Pooling layer. This gives the same results as penultimate_layer=299.

Solution

Cannot say anything about your own data, but the penultimate layer of Inception V3 for Grad-CAM visualization is indeed mixed10 (idx 310), as reported in the notebook you have linked to:

310 is concatenation before global average pooling

Rationale: since the output of conv2d_94 (299) is connected downstream with other convolutional layers (or concatenations of), like mixed9_1, concatenate_2 etc., by definition it cannot be the penultimate convolutional layer; mixed10, on the other hand, is not - on the contrary, it is just one layer before the final average pooling one. That the penultimate layer should be a convolutional, and not a pooling one, is suggested from Chollet's exchibition, where for VGG he uses block5_conv3, and not block5_pool which is immediately afterwards (although truth is, even using block5_pool seems to give very similar visual results).

Let me elaborate a little, and explain the emphasis on "suggested" above...

As many other things in current deep learning research & practice, Grad-CAM is a heuristic, not a "hard" scientific method; as such, there are recommendations & expectations on how to use it and what the results might be, but not hard rules (and "appropriate" layers). Consider the following excerpt from the original paper (end of section 2, emphasis mine):

We expect the last convolutional layers to have the best compromise between high-level semantics and detailed spatial information, so we use these feature maps to compute Grad-CAM and Guided Grad-CAM.

i.e. there are indeed recommendations & expectations, as I already said, but a certain experimenting & free-wheeling attitude is expected...

Now, assuming you are following Chollet's notebook on the subject (i.e. using pure Keras, and not the Keras-vis package), these are the changes in the code you need in order to make it work with Inception V3:

# cell 24
from keras import backend as K
from keras.applications.inception_v3 import InceptionV3
K.clear_session()
K.set_learning_phase(0) # needs to be set BEFORE building the model
model = InceptionV3(weights='imagenet')

# in cell 27
from keras.applications.inception_v3 import preprocess_input, decode_predictions
img = image.load_img(img_path, target_size=(299, 299)) # different size than VGG

# in cell 31:
last_conv_layer = model.get_layer('mixed10')
for i in range(2048):  # was 512 for VGG
    conv_layer_output_value[:, :, i] *= pooled_grads_value[i]

And the resulting superimposed heatmap on the original creative_commons_elephant.jpg image should look like this:

which, arguably, is not that different than the respective image by VGG produced in Chollet's notebook (although admittedly the heatmap is indeed more spread, and it does not seem to conform to Chollet's narrative about 'focusing on the ears')...