Tensorflow Embedding Layer Vocabulary Size

I am learning Tensorflow and have come across the Embedding layer in tensorflow used to learn one's own word embeddings. The layer takes the following parameters:

keras.layers.Embedding(input_dim, 
                       output_dim, 
                       embeddings_initializer='uniform',
                       embeddings_regularizer=None, 
                       activity_regularizer=None, 
                       embeddings_constraint=None, 
                       mask_zero=False, 
                       input_length=None)

The 'input dim' should be the same size as the vocabulary i.e. unique words. If I wanted to limit the vocabulary to only the first 25000 most frequent words - how should I do this?

Can I simply change 'input_dim' to 25000 or would I have to go through my corpus and replace any word that is outside the top 25000 words with an token for example?

Solution

Actually, if you use tensorflow.keras you have to make sure in your corpus, the tokens don't exceed the vocabulary_size or the input_dim of embedding layer, otherwise you'll get error.

If you use keras, then you can just change the input_dim in your embedding layer without changing anything in corpus or tokens. keras will replace out of vocabulary tokens with a zero vector.

First of all, there is an error if you use tensorflow.keras.

tensorflow

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, Input
import numpy as np

ip = Input(shape = (3,))
emb = Embedding(1, 2, trainable=True, mask_zero=True)(ip)

model = Model(ip, emb)
input_array = np.array([[5, 3, 1], [1, 2, 3]]) # out of vocabulary

model.compile("rmsprop", "mse")

output_array = model.predict(input_array)

print(output_array)

print(output_array.shape)

model.summary()

But if I use keras 2.3.1, I don't get any error.

keras 2.3.1

from keras.models import Model
from keras.layers import Embedding, Input
import numpy as np

ip = Input(shape = (3,))
emb = Embedding(1, 2, trainable=True, mask_zero=True)(ip)

model = Model(ip, emb)
input_array = np.array([[5, 3, 1], [1, 2, 3]])

model.compile("rmsprop", "mse")

output_array = model.predict(input_array)

print(output_array)

print(output_array.shape)

model.summary()

keras has different implementations for embedding layer. To validate that, let's go to keras embedding layer.

https://github.com/keras-team/keras/blob/master/keras/layers/embeddings.py#L16

For now let's just look into call function.

    def call(self, inputs):
        if K.dtype(inputs) != 'int32':
            inputs = K.cast(inputs, 'int32')
        out = K.gather(self.embeddings, inputs)
        return out

N.B: If you want the exact source code for keras 2.3.1 go here and download source code: https://github.com/keras-team/keras/releases

But if we go to tensorflow implementation, it's different.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/embedding_ops.py

Just to verify, the call function is differently written.

  def call(self, inputs):
    dtype = K.dtype(inputs)
    if dtype != 'int32' and dtype != 'int64':
      inputs = math_ops.cast(inputs, 'int32')
    out = embedding_ops.embedding_lookup(self.embeddings, inputs)
    return out

Let's design a simple network like before and observe the weight matrix.

from keras.models import Model
from keras.layers import Embedding, Input
import numpy as np

ip = Input(shape = (3,))
emb = Embedding(1, 2, trainable=True, mask_zero=True)(ip)

model = Model(ip, emb)
input_array = np.array([[5, 3, 1], [1, 2, 3]])

model.compile("rmsprop", "mse")

output_array = model.predict(input_array)

print(output_array)

print(output_array.shape)

model.summary()

The model gives the following output.

[[[0. 0.]
  [0. 0.]
  [0. 0.]]

 [[0. 0.]
  [0. 0.]
  [0. 0.]]]
(2, 3, 2)
Model: "model_18"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_21 (InputLayer)        (None, 3)                 0         
_________________________________________________________________
embedding_33 (Embedding)     (None, 3, 2)              2         
=================================================================
Total params: 2
Trainable params: 2
Non-trainable params: 0

Okay, we are getting bunch of zeros but the default weight_initializer is not zeros!

So, let's observe the weight matrix now.

import keras.backend as K

w = model.layers[1].get_weights()
print(w)

[array([[ 0.03680499, -0.04904002]], dtype=float32)]

In fact, it is not all zeros.

So, why are we getting zeros?

Let's change our input to the model.

As the only in vocabulary word index for input_dim = 1, is 0. Let's pass 0 as one of the inputs.

from keras.models import Model
from keras.layers import Embedding, Input
import numpy as np

ip = Input(shape = (3,))
emb = Embedding(1, 2, trainable=True, mask_zero=True)(ip)

model = Model(ip, emb)
input_array = np.array([[5, 0, 1], [1, 2, 0]])

model.compile("rmsprop", "mse")

output_array = model.predict(input_array)

print(output_array)

print(output_array.shape)

model.summary()

Now, we get non-zero vectors for the positions where we passed 0.

[[[ 0.          0.        ]
  [-0.04339869 -0.04900574]
  [ 0.          0.        ]]

 [[ 0.          0.        ]
  [ 0.          0.        ]
  [-0.04339869 -0.04900574]]]
(2, 3, 2)
Model: "model_19"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_22 (InputLayer)        (None, 3)                 0         
_________________________________________________________________
embedding_34 (Embedding)     (None, 3, 2)              2         
=================================================================
Total params: 2
Trainable params: 2
Non-trainable params: 0

In short, Keras maps any out of vocabulary word index with a zero vector and this is reasonable as for those positions the forward pass will ensure all the contributions are NIL (the biases may have a role though). That is a little bit counter-intuitive as passing out of vocabulary tokens to the model seems an overhead (rather than just removing them in the pre-processing step) and bad practice but it is a good fix to test different input_dim without re-calculating tokens.