As part of my thesis, I am trying to build a recurrent Neural Network Language Model.
From theory, I know that the input layer should be a one-hot vector layer with a number of neurons equal to the number of words of our Vocabulary, followed by an Embedding layer, which, in Keras, it apparently translates to a single Embedding layer in a Sequential model. I also know that the output layer should also be the size of our vocabulary so that each output value maps 1-1 to each vocabulary word.
However, in both the Keras documentation for the Embedding layer (https://keras.io/layers/embeddings/) and in this article (https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/#comment-533252), the vocabulary size is arbitrarily augmented by one for both the input and the output layers! Jason gives an explenation that this is due to the implementation of the Embedding layer in Keras but that doesn't explain why we would also use +1 neuron in the output layer. I am at the point of wanting to order the possible next words based on their probabilities and I have one probability too many that I do not know to which word to map it too.
Does anyone know what is the correct way of acheiving the desired result? Did Jason just forget to subtrack one from the output layer and the Embedding layer just needs a +1 for implementation reasons (I mean it's stated in the official API)?
Any help on the subject would be appreciated (why is Keras API documentation so laconic?).
Edit:
This post Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2? made me think that Jason does in fact have it wrong and that the size of the Vocabulary should not be incremented by one when our word indices are: 0, 1, ..., n-1
.
However, when using Keras's Tokenizer our word indices are: 1, 2, ..., n
. In this case, the correct approach is to:
Set mask_zero=True
, to treat 0 differently, as there is never a
0 (integer) index input in the Embedding layer and keep the
vocabulary size the same as the number of vocabulary words (n
)?
Set mask_zero=True
but augment the vocabulary size by one?
Not set mask_zero=True
and keep the vocabulary size the same as the
number of vocabulary words?
the reason why we add +1 leads to the possibility that we can encounter a chance to see an unseen word(out of our vocabulary) during testing or in production, it is common to consider a generic term for those UNKNOWN and that is why we add a OOV
word in front which resembles all out of vocabulary words.
Check this issue on github which explains it in detail:
https://github.com/keras-team/keras/issues/3110#issuecomment-345153450