Search code examples
rkerastensorflow2.0word-embedding

Embedding Layer in Keras: Vocab Size +1


From a number of examples I have seen, when we use text_tokenizer from keras, when specifying the input size for the input layer, we use vocab size +1. This naturally yields an embedding space with +1 'rows'.

For example, I fit a simple model to estimate the embedding vectors for a vocab of size 3 = I like turtles. The embedding space has length 5 per word in our vocabulary.

The embedding weights are:

0.01209533  0.034303080 -0.04666784 0.02803965  -0.03691160
-0.01302978 -0.030584216    -0.02506201 0.04771456  0.01906699
0.02800793  0.042204402 0.05223191  -0.01184921 0.02000498
0.02692273  -0.008792922    0.01560913  -0.02783649 0.02692282

My question: I assume that the first "row" in our matrix is the 0-based vector, such that rows 2, 3, and 4 would be associated with "I", "like", and "turtles" respectively.

Is this the case? I want to ensure that I align my vocabulary properly, but I haven't been able to pin down any documentation to confirm this assumption.


Solution

  • I understand that you are wanting to extract the embedding for each word, but I think the real question is: What is the output the tokenizer is producing.

    Also, that tokenizer is a bit of a mess. You'll see what I mean below.

    Because the tokenizer will filter words (assuming a non-trivial vocabulary), I don't want to assume that the words are stored in the order in which they are found. So here I programmatically determine the vocabulary using word_index. I then explicitly check what words are tokenized after filtering for the most frequently used words. (Word_index remembers all words; i.e. the pre-filtered values.)

    import tensorflow as tf
    from tensorflow.keras.preprocessing.text import Tokenizer
    corpus = 'I like turtles'
    num_words = len(corpus.split())
    oov = 'OOV'
    tokenizer = Tokenizer(num_words=num_words + 2, oov_token=oov)
    tokenizer.fit_on_texts(corpus.split())
    print(f'word_index: {tokenizer.word_index}')
    print(f'vocabulary: {tokenizer.word_index.keys()}')
    text = [key for key in tokenizer.word_index.keys()]
    print(f'keys: {text}: {tokenizer.texts_to_sequences(text)}')
    
    text = 'I like turtles'.split()
    print(f'{text}: {tokenizer.texts_to_sequences(text)}')
    
    text = 'I like marshmallows'.split() 
    print(f'{text}: {tokenizer.texts_to_sequences(text)}')
    

    This produces the following output:

    word_index: {'OOV': 1, 'i': 2, 'like': 3, 'turtles': 4}
    vocabulary: dict_keys(['OOV', 'i', 'like', 'turtles'])
    keys: ['OOV', 'i', 'like', 'turtles']: [[1], [2], [3], [4]]
    ['I', 'like', 'turtles']: [[2], [3], [4]]
    ['I', 'like', 'marshmallows']: [[2], [3], [1]]
    

    However, if you specify oov_token, the output looks like this:

    {'OOV': 1, 'i': 2, 'like': 3, 'turtles': 4}
    

    Notice how I had to specify num_words=num_words + 2 instead of the expected '+1'. That's because we're explicitly defining an OOV token, which gets added to the vocabulary, which is a bit nuts imo.

    If you specify an OOV token and you set num_words=num_words + 1 (as documented), then 'I like turtles' gets the same encoding as 'I like marshmallows'. Also nuts.

    Hopefully, you now have to tools to know what the tokenizer is feeding the encoding layer. Then hopefully, it'll be trivial to correlate the tokens with their embeddings.

    Please let us know what you find. :)

    (For more on the madness, check out this StackOverflow post.)