Search code examples
pythontensorflowword-embedding

Tensorflow Embedding Layer Vocabulary Size


I am learning Tensorflow and have come across the Embedding layer in tensorflow used to learn one's own word embeddings. The layer takes the following parameters:

keras.layers.Embedding(input_dim, 
                       output_dim, 
                       embeddings_initializer='uniform',
                       embeddings_regularizer=None, 
                       activity_regularizer=None, 
                       embeddings_constraint=None, 
                       mask_zero=False, 
                       input_length=None)

The 'input dim' should be the same size as the vocabulary i.e. unique words. If I wanted to limit the vocabulary to only the first 25000 most frequent words - how should I do this?

Can I simply change 'input_dim' to 25000 or would I have to go through my corpus and replace any word that is outside the top 25000 words with an token for example?


Solution

  • Actually, if you use tensorflow.keras you have to make sure in your corpus, the tokens don't exceed the vocabulary_size or the input_dim of embedding layer, otherwise you'll get error.

    If you use keras, then you can just change the input_dim in your embedding layer without changing anything in corpus or tokens. keras will replace out of vocabulary tokens with a zero vector.

    First of all, there is an error if you use tensorflow.keras.

    tensorflow

    from tensorflow.keras.models import Model
    from tensorflow.keras.layers import Embedding, Input
    import numpy as np
    
    ip = Input(shape = (3,))
    emb = Embedding(1, 2, trainable=True, mask_zero=True)(ip)
    
    model = Model(ip, emb)
    input_array = np.array([[5, 3, 1], [1, 2, 3]]) # out of vocabulary
    
    model.compile("rmsprop", "mse")
    
    output_array = model.predict(input_array)
    
    print(output_array)
    
    print(output_array.shape)
    
    model.summary()
    

    enter image description here

    But if I use keras 2.3.1, I don't get any error.

    keras 2.3.1

    from keras.models import Model
    from keras.layers import Embedding, Input
    import numpy as np
    
    ip = Input(shape = (3,))
    emb = Embedding(1, 2, trainable=True, mask_zero=True)(ip)
    
    model = Model(ip, emb)
    input_array = np.array([[5, 3, 1], [1, 2, 3]])
    
    model.compile("rmsprop", "mse")
    
    output_array = model.predict(input_array)
    
    print(output_array)
    
    print(output_array.shape)
    
    model.summary()
    

    enter image description here

    keras has different implementations for embedding layer. To validate that, let's go to keras embedding layer.

    https://github.com/keras-team/keras/blob/master/keras/layers/embeddings.py#L16

    For now let's just look into call function.

        def call(self, inputs):
            if K.dtype(inputs) != 'int32':
                inputs = K.cast(inputs, 'int32')
            out = K.gather(self.embeddings, inputs)
            return out
    

    N.B: If you want the exact source code for keras 2.3.1 go here and download source code: https://github.com/keras-team/keras/releases

    But if we go to tensorflow implementation, it's different.

    https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/embedding_ops.py

    Just to verify, the call function is differently written.

      def call(self, inputs):
        dtype = K.dtype(inputs)
        if dtype != 'int32' and dtype != 'int64':
          inputs = math_ops.cast(inputs, 'int32')
        out = embedding_ops.embedding_lookup(self.embeddings, inputs)
        return out
    

    Let's design a simple network like before and observe the weight matrix.

    from keras.models import Model
    from keras.layers import Embedding, Input
    import numpy as np
    
    ip = Input(shape = (3,))
    emb = Embedding(1, 2, trainable=True, mask_zero=True)(ip)
    
    model = Model(ip, emb)
    input_array = np.array([[5, 3, 1], [1, 2, 3]])
    
    model.compile("rmsprop", "mse")
    
    output_array = model.predict(input_array)
    
    print(output_array)
    
    print(output_array.shape)
    
    model.summary()
    

    The model gives the following output.

    [[[0. 0.]
      [0. 0.]
      [0. 0.]]
    
     [[0. 0.]
      [0. 0.]
      [0. 0.]]]
    (2, 3, 2)
    Model: "model_18"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    input_21 (InputLayer)        (None, 3)                 0         
    _________________________________________________________________
    embedding_33 (Embedding)     (None, 3, 2)              2         
    =================================================================
    Total params: 2
    Trainable params: 2
    Non-trainable params: 0
    

    Okay, we are getting bunch of zeros but the default weight_initializer is not zeros!

    So, let's observe the weight matrix now.

    import keras.backend as K
    
    w = model.layers[1].get_weights()
    print(w)
    
    
    [array([[ 0.03680499, -0.04904002]], dtype=float32)]
    

    In fact, it is not all zeros.

    So, why are we getting zeros?

    Let's change our input to the model.

    As the only in vocabulary word index for input_dim = 1, is 0. Let's pass 0 as one of the inputs.

    from keras.models import Model
    from keras.layers import Embedding, Input
    import numpy as np
    
    ip = Input(shape = (3,))
    emb = Embedding(1, 2, trainable=True, mask_zero=True)(ip)
    
    model = Model(ip, emb)
    input_array = np.array([[5, 0, 1], [1, 2, 0]])
    
    model.compile("rmsprop", "mse")
    
    output_array = model.predict(input_array)
    
    print(output_array)
    
    print(output_array.shape)
    
    model.summary()
    

    Now, we get non-zero vectors for the positions where we passed 0.

    [[[ 0.          0.        ]
      [-0.04339869 -0.04900574]
      [ 0.          0.        ]]
    
     [[ 0.          0.        ]
      [ 0.          0.        ]
      [-0.04339869 -0.04900574]]]
    (2, 3, 2)
    Model: "model_19"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    input_22 (InputLayer)        (None, 3)                 0         
    _________________________________________________________________
    embedding_34 (Embedding)     (None, 3, 2)              2         
    =================================================================
    Total params: 2
    Trainable params: 2
    Non-trainable params: 0
    

    In short, Keras maps any out of vocabulary word index with a zero vector and this is reasonable as for those positions the forward pass will ensure all the contributions are NIL (the biases may have a role though). That is a little bit counter-intuitive as passing out of vocabulary tokens to the model seems an overhead (rather than just removing them in the pre-processing step) and bad practice but it is a good fix to test different input_dim without re-calculating tokens.