Search code examples
pythontensorflowkerasembeddingword-embedding

Why in Keras embedding layer's matrix is a size of vocab_size + 1?


I have below toy example where my vocabulary size is 7, embedding size is 8 BUT weights output of Keras Embedding layer is 8x8. (?) How is that? This seems to be connected to other questions related to Keras embedding layer being "maximum integer index + 1" and I've read all the other stackoverflow queries on this, but all of them suggest it's not vocab_size + 1 while my code tells me it is. I'm asking this as I'd need to know which exactly embeding vector relates to which word.

docs = ['Well done!',
            'Good work',
            'Great effort',
            'nice work']
labels = np.array([1,1,1,1])
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
encoded_docs = tokenizer.texts_to_sequences(docs)
max_seq_len = max(len(x) for x in encoded_docs) # max len is 2
padded_seq = pad_sequences(sequences=encoded_docs,maxlen=max_seq_len,padding='post')
embedding_size = 8
tokenizer.index_word

{1: 'work', 2: 'well', 3: 'done', 4: 'good', 5: 'great', 6: 'effort', 7: 'nice'}

    len(tokenizer.index_word) # 7
    vocab_size = len(tokenizer.index_word)+1 
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size,output_dim=embedding_size,input_length=max_seq_len, name='embedding_lay'))
    model.add(Flatten())
    model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['acc'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_lay (Embedding)    (None, 2, 8)              64        
_________________________________________________________________
flatten_1 (Flatten)          (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
=================================================================
Total params: 81
Trainable params: 81
Non-trainable params: 0

model.fit(padded_seq,labels, verbose=1,epochs=20)
model.get_layer('embedding_lay').get_weights()

[array([[-0.0389936 , -0.0294274 ,  0.02361362,  0.01885288, -0.01246006,
         -0.01004354,  0.01321061, -0.02298149],
        [-0.01264734, -0.02058442,  0.0114141 , -0.02725944, -0.06267354,
          0.05148344, -0.02335678, -0.06039589],
        [ 0.0582506 ,  0.00020944, -0.04691287,  0.02985037,  0.02437406,
         -0.02782   ,  0.00378997,  0.01849808],
        [-0.01667434, -0.00078654, -0.04029636, -0.04981862,  0.01762467,
          0.06667487,  0.00302309,  0.02881355],
        [ 0.04509508, -0.01994639,  0.01837089, -0.00047283,  0.01141069,
         -0.06225454,  0.01198813,  0.02102971],
        [ 0.05014603,  0.04591557, -0.03119368,  0.04181939,  0.02837115,
         -0.01640332,  0.0577693 ,  0.01364574],
        [ 0.01948108, -0.04200416, -0.06589368, -0.05397511,  0.02729052,
          0.04164972, -0.03795817, -0.06763416],
        [ 0.01284658,  0.05563928, -0.026766  ,  0.03231764, -0.0441488 ,
         -0.02879154,  0.02092744,  0.01947528]], dtype=float32)]

So how do I get my 7 words vectors for instance for {1: 'work'...} from 8th vectors (rows) matrix and what does that 8th vector mean ? If I change vocab_size = len(tokenizer.index_word) - not adding (+1) then when trying to fit the model I'm getting size errors etc.


Solution

  • The Embedding layer uses tf.nn.embedding_lookup under the hood, which is zero-based by default. For example:

    import tensorflow as tf
    import numpy as np
    
    docs = ['Well done!',
                'Good work',
                'Great effort',
                'nice work']
    tokenizer = tf.keras.preprocessing.text.Tokenizer()
    tokenizer.fit_on_texts(docs)
    encoded_docs = tokenizer.texts_to_sequences(docs)
    max_seq_len = max(len(x) for x in encoded_docs) # max len is 2
    padded_seq = tf.keras.preprocessing.sequence.pad_sequences(sequences=encoded_docs,maxlen=max_seq_len,padding='post')
    embedding_size = 8
    
    tf.random.set_seed(111)
    
    # Create integer embeddings for demonstration purposes.
    embeddings = tf.cast(tf.random.uniform((7, embedding_size), minval=10,  maxval=20, dtype=tf.int32), dtype=tf.float32)
    
    print(padded_seq)
    
    tf.nn.embedding_lookup(embeddings, padded_seq)
    
    [[2 3]
     [4 1]
     [5 6]
     [7 1]]
    <tf.Tensor: shape=(4, 2, 8), dtype=float32, numpy=
    array([[[17., 11., 10., 16., 17., 16., 16., 17.],
            [18., 15., 13., 13., 18., 18., 10., 16.]],
    
           [[17., 16., 13., 12., 13., 15., 19., 14.],
            [12., 15., 12., 15., 10., 19., 15., 12.]],
    
           [[18., 15., 11., 13., 13., 13., 16., 10.],
            [18., 18., 11., 12., 10., 13., 14., 10.]],
    
        --> [[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.] <--,
            [12., 15., 12., 15., 10., 19., 15., 12.]]], dtype=float32)>
    

    Notice how the integer 7 is mapped to zero, because the tf.nn.embedding_lookup only knows how to map values from 0 to 6. That is the reason, you should use vocab_size = len(tokenizer.index_word)+1, since you want a meaningful vector for the integer 7:

    embeddings = tf.cast(tf.random.uniform((8, embedding_size), minval=10,  maxval=20, dtype=tf.int32), dtype=tf.float32)
    
    tf.nn.embedding_lookup(embeddings, padded_seq)
    

    The index 0 could then be reserved for unknown tokens, since your vocabulary starts from 1.