I am reviewing Denny Britz's tutorial on text classification using CNNs
in TensorFlow
. Filter and stride shapes make perfect sense in the image domain. However, when it comes to text, I am confused on how to correctly define the stride and filter shapes. Consider the following two layers from Denny's code:
# Create a convolution + maxpool layer for each filter size
pooled_outputs = []
for i, filter_size in enumerate(filter_sizes):
with tf.name_scope("conv-maxpool-%s" % filter_size):
# Convolution Layer
filter_shape = [filter_size, embedding_size, 1, num_filters]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
conv = tf.nn.conv2d(
self.embedded_chars_expanded,
W,
strides=[1, 1, 1, 1],
padding="VALID",
name="conv")
# Apply nonlinearity
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
# Maxpooling over the outputs
pooled = tf.nn.max_pool(
h,
ksize=[1, sequence_length - filter_size + 1, 1, 1],
strides=[1, 1, 1, 1],
padding='VALID',
name="pool")
pooled_outputs.append(pooled)
The shape of self.embedded_chars_expanded
is [batch_size, max_sentence_length, embedding_size, 1]
which means each batch member is a single channel matrix of max_sentence_length x embedding_size
Filters
Suppose my filter_shape is [3, 50, 1, 16]
. I interpret this as the filter will slide over 3
word vectors at a time which has dimensionality 50
. This is text, so the 1
corresponds to a single channel of input (as opposed to 3
with RGB
). Lastly, the 16
implies I will have 16
filters in the conv layer
.
Have I interpreted this correctly?
Strides
Similarly, stride shapes, in both the conv and pooling layers are defined as [1, 1, 1, 1]
.
Does this shape's dimensions correspond to the dimensions of the filter_shape
?
If so, this is why I am confused. It would seem that the nature of word vector representations means that the stride length should be [1, embedding_size, 1, 1]
meaning I want to move the window one full word at-a-time over one channel for each filter.
Filters
Have I interpreted this correctly?
Yes, exactly.
Strides
Does this shape's dimensions correspond to the dimensions of the filter_shape?
Yes, it corresponds to the strides in which you convolve the filter on the input embedding.
It would seem that the nature of word vector representations means that the stride length should be [1, embedding_size, 1, 1] meaning I want to move the window one full word at-a-time over one channel for each filter.
Pay attention to the padding strategy - the padding in conv2d
is set to be VALID
. This means there will be no padding. Since the filter size in the embedding dimension covers the input entirely, it can fit only once without any consideration of the stride along this dimension.
Put differently - you can convolve along the embedding dimension only once independently of the stride.