Search code examples
tensorflowkerasdeep-learningnlpword-embedding

is word embedding in Keras a dimensionality reduction technique also?


I wanted to understand the purpose of embedding_dim vs using a one hot vector of the entire vocab_size, Is it a dimension reduction to the one hot vector from vocab_size dim to embedding_dim dimensions or is there any other utility intuitively? Also how should one decide the embedding_dim number?

Code -

    vocab_size = 10000
    embedding_dim = 16
    max_length = 120
    
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(6, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
    model.summary()

O/P -

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 120, 16)           160000    
_________________________________________________________________
flatten (Flatten)            (None, 1920)              0         
_________________________________________________________________
dense (Dense)                (None, 6)                 11526     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
=================================================================
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________

Solution

  • When you have a small number of categorical features and less training data you have to use a one-hot encoding. If you have large training data and a large number of categorical features you have to use embeddings.

    Why were Embeddings developed?
    If you have a large number of categorical features and you used one-hot encoding you will end up getting a huge sparse matrix with most of the elements as zero. This is not suitable for training ML models. Your data will suffer from the curse of dimensionality. With embeddings, you can essentially represent a large number of categorical features using a smaller dimension. Also, the output is a dense vector rather than a sparse vector.

    Drawbacks of embeddings:

    • Requires time to train
    • Requires a large amount of training data

    Advantage

    • Embeddings can tell you about the semantics of items. It groups related items close together. This is not the case with one-hot encoding. One-hot encoding is just an orthogonal representation of an item in another dimension.

    What size to select for embedding vector.

    embedding_dimensions =  vocab_size ** 0.25
    

    You can see here.

    Note: This is just a thumb rule. You can select embedding dimensions smaller or greater than this. The quality of word embedding increases with higher dimensionality. But after reaching some point, the marginal gain will diminish.