I have below toy example where my vocabulary size is 7, embedding size is 8 BUT weights output of Keras Embedding layer is 8x8. (?) How is that? This seems to be connected to other questions related to Keras embedding layer being "maximum integer index + 1" and I've read all the other stackoverflow queries on this, but all of them suggest it's not vocab_size + 1 while my code tells me it is. I'm asking this as I'd need to know which exactly embeding vector relates to which word.
docs = ['Well done!',
'Good work',
'Great effort',
'nice work']
labels = np.array([1,1,1,1])
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
encoded_docs = tokenizer.texts_to_sequences(docs)
max_seq_len = max(len(x) for x in encoded_docs) # max len is 2
padded_seq = pad_sequences(sequences=encoded_docs,maxlen=max_seq_len,padding='post')
embedding_size = 8
tokenizer.index_word
{1: 'work', 2: 'well', 3: 'done', 4: 'good', 5: 'great', 6: 'effort', 7: 'nice'}
len(tokenizer.index_word) # 7
vocab_size = len(tokenizer.index_word)+1
model = Sequential()
model.add(Embedding(input_dim=vocab_size,output_dim=embedding_size,input_length=max_seq_len, name='embedding_lay'))
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['acc'])
model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_lay (Embedding) (None, 2, 8) 64
_________________________________________________________________
flatten_1 (Flatten) (None, 16) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 17
=================================================================
Total params: 81
Trainable params: 81
Non-trainable params: 0
model.fit(padded_seq,labels, verbose=1,epochs=20)
model.get_layer('embedding_lay').get_weights()
[array([[-0.0389936 , -0.0294274 , 0.02361362, 0.01885288, -0.01246006,
-0.01004354, 0.01321061, -0.02298149],
[-0.01264734, -0.02058442, 0.0114141 , -0.02725944, -0.06267354,
0.05148344, -0.02335678, -0.06039589],
[ 0.0582506 , 0.00020944, -0.04691287, 0.02985037, 0.02437406,
-0.02782 , 0.00378997, 0.01849808],
[-0.01667434, -0.00078654, -0.04029636, -0.04981862, 0.01762467,
0.06667487, 0.00302309, 0.02881355],
[ 0.04509508, -0.01994639, 0.01837089, -0.00047283, 0.01141069,
-0.06225454, 0.01198813, 0.02102971],
[ 0.05014603, 0.04591557, -0.03119368, 0.04181939, 0.02837115,
-0.01640332, 0.0577693 , 0.01364574],
[ 0.01948108, -0.04200416, -0.06589368, -0.05397511, 0.02729052,
0.04164972, -0.03795817, -0.06763416],
[ 0.01284658, 0.05563928, -0.026766 , 0.03231764, -0.0441488 ,
-0.02879154, 0.02092744, 0.01947528]], dtype=float32)]
So how do I get my 7 words vectors for instance for {1: 'work'...} from 8th vectors (rows) matrix and what does that 8th vector mean ? If I change vocab_size = len(tokenizer.index_word) - not adding (+1) then when trying to fit the model I'm getting size errors etc.
The Embedding
layer uses tf.nn.embedding_lookup
under the hood, which is zero-based by default. For example:
import tensorflow as tf
import numpy as np
docs = ['Well done!',
'Good work',
'Great effort',
'nice work']
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(docs)
encoded_docs = tokenizer.texts_to_sequences(docs)
max_seq_len = max(len(x) for x in encoded_docs) # max len is 2
padded_seq = tf.keras.preprocessing.sequence.pad_sequences(sequences=encoded_docs,maxlen=max_seq_len,padding='post')
embedding_size = 8
tf.random.set_seed(111)
# Create integer embeddings for demonstration purposes.
embeddings = tf.cast(tf.random.uniform((7, embedding_size), minval=10, maxval=20, dtype=tf.int32), dtype=tf.float32)
print(padded_seq)
tf.nn.embedding_lookup(embeddings, padded_seq)
[[2 3]
[4 1]
[5 6]
[7 1]]
<tf.Tensor: shape=(4, 2, 8), dtype=float32, numpy=
array([[[17., 11., 10., 16., 17., 16., 16., 17.],
[18., 15., 13., 13., 18., 18., 10., 16.]],
[[17., 16., 13., 12., 13., 15., 19., 14.],
[12., 15., 12., 15., 10., 19., 15., 12.]],
[[18., 15., 11., 13., 13., 13., 16., 10.],
[18., 18., 11., 12., 10., 13., 14., 10.]],
--> [[ 0., 0., 0., 0., 0., 0., 0., 0.] <--,
[12., 15., 12., 15., 10., 19., 15., 12.]]], dtype=float32)>
Notice how the integer 7 is mapped to zero, because the tf.nn.embedding_lookup
only knows how to map values from 0 to 6. That is the reason, you should use vocab_size = len(tokenizer.index_word)+1
, since you want a meaningful vector for the integer 7:
embeddings = tf.cast(tf.random.uniform((8, embedding_size), minval=10, maxval=20, dtype=tf.int32), dtype=tf.float32)
tf.nn.embedding_lookup(embeddings, padded_seq)
The index 0 could then be reserved for unknown tokens, since your vocabulary starts from 1.