In CNN literature, it is often illustrated that kernel size is same as size of the longest word in the vocabulary list that one has, when it sweeps across a sentence.
So if we use embedding to represent the text, then shouldn't the kernel size be same as the embedding dimension so that it gives the same effect as sweeping word by word?
I see difference sizes of kernel used, despite the word length.
Well... these are 1D convolutions, for which the kernels are 3 dimensional.
It's true that one of these 3 dimensions must match the embedding size (otherwise it would be pointless to have this size)
These three dimensions are:
(length_or_size, input_channels, output_channels)
Where:
length_or_size
(kernel_size
): anything you want. In the picture, there are 6 different filters with sizes 4, 4, 3, 3, 2, 2, represented by the "vertical" dimension. input_channels
(automatically the embedding_size
): the size of the embedding - this is somwehat mandatory (in Keras this is automatic and almost invisible), otherwise the multiplications wouldn't use the entire embedding, which is pointless. In the picture, the "horizontal" dimension of the filters is constantly 5 (the same as the word size - this is not a spatial dimension). output_channels
(filters
): anything you want, but it seems the picture is talking about 1 channel only per filter, since it's totally ignored, and if represented would be something like "depth". So, you're probably confusing which dimensions are which. When you define a conv layer, you do:
Conv1D(filters = output_channels, kernel_size=length_or_size)
While the input_channels
come from the embedding (or the previous layer) automatically.
To create this model, it would be something like:
sentence_length = 7
embedding_size=5
inputs = Input((sentence_length,))
out = Embedding(total_words_in_dic, embedding_size)
Now, supposing these filters have 1 channel only (since the image doesn't seem to consider their depth...), we can join them in pairs of 2 channels:
size1 = 4
size2 = 3
size3 = 2
output_channels=2
out1 = Conv1D(output_channels, size1, activation=activation_function)(out)
out2 = Conv1D(output_channels, size2, activation=activation_function)(out)
out3 = Conv1D(output_channels, size3, activation=activation_function)(out)
Now, let's collapse the spatial dimensions and remain with the two channels:
out1 = GlobalMaxPooling1D()(out1)
out2 = GlobalMaxPooling1D()(out2)
out3 = GlobalMaxPooling1D()(out3)
And create the 6 channel output:
out = Concatenate()([out1,out2,out3])
Now there is a mistery jump from 6 channels to 2 channels which cannot be explained by the picture. Perhaps they're applying a Dense layer or something.......
#????????????????
out = Dense(2, activation='softmax')(out)
model = Model(inputs, out)