python tensorflow conv-neural-network text-classification

TensorFlow - Understanding filter and stride shapes for CNN text classification

I am reviewing Denny Britz's tutorial on text classification using CNNs in TensorFlow. Filter and stride shapes make perfect sense in the image domain. However, when it comes to text, I am confused on how to correctly define the stride and filter shapes. Consider the following two layers from Denny's code:

# Create a convolution + maxpool layer for each filter size
pooled_outputs = []
for i, filter_size in enumerate(filter_sizes):
    with tf.name_scope("conv-maxpool-%s" % filter_size):
        # Convolution Layer
        filter_shape = [filter_size, embedding_size, 1, num_filters]
        W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
        b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
        conv = tf.nn.conv2d(
            self.embedded_chars_expanded,
                W,
                strides=[1, 1, 1, 1],
                padding="VALID",
                name="conv")
        # Apply nonlinearity
        h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
        # Maxpooling over the outputs
        pooled = tf.nn.max_pool(
            h,
            ksize=[1, sequence_length - filter_size + 1, 1, 1],
            strides=[1, 1, 1, 1],
            padding='VALID',
            name="pool")
        pooled_outputs.append(pooled)

The shape of self.embedded_chars_expanded is [batch_size, max_sentence_length, embedding_size, 1] which means each batch member is a single channel matrix of max_sentence_length x embedding_size

Filters

Suppose my filter_shape is [3, 50, 1, 16]. I interpret this as the filter will slide over 3 word vectors at a time which has dimensionality 50. This is text, so the 1 corresponds to a single channel of input (as opposed to 3 with RGB). Lastly, the 16 implies I will have 16 filters in the conv layer.

Have I interpreted this correctly?

Strides

Similarly, stride shapes, in both the conv and pooling layers are defined as [1, 1, 1, 1].

Does this shape's dimensions correspond to the dimensions of the filter_shape?

If so, this is why I am confused. It would seem that the nature of word vector representations means that the stride length should be [1, embedding_size, 1, 1] meaning I want to move the window one full word at-a-time over one channel for each filter.

Solution

Filters

Have I interpreted this correctly?

Yes, exactly.

Strides

Does this shape's dimensions correspond to the dimensions of the filter_shape?

Yes, it corresponds to the strides in which you convolve the filter on the input embedding.

It would seem that the nature of word vector representations means that the stride length should be [1, embedding_size, 1, 1] meaning I want to move the window one full word at-a-time over one channel for each filter.

Pay attention to the padding strategy - the padding in conv2d is set to be VALID. This means there will be no padding. Since the filter size in the embedding dimension covers the input entirely, it can fit only once without any consideration of the stride along this dimension.

Put differently - you can convolve along the embedding dimension only once independently of the stride.