I wanted to understand the purpose of embedding_dim
vs using a one hot vector of the entire vocab_size
, Is it a dimension reduction to the one hot vector from vocab_size
dim to embedding_dim
dimensions or is there any other utility intuitively? Also how should one decide the embedding_dim
Code -
vocab_size = 10000
embedding_dim = 16
max_length = 120
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
O/P -
Model: "sequential"
Layer (type) Output Shape Param #
embedding (Embedding) (None, 120, 16) 160000
flatten (Flatten) (None, 1920) 0
dense (Dense) (None, 6) 11526
dense_1 (Dense) (None, 1) 7
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
When you have a small number of categorical features and less training data you have to use a one-hot encoding. If you have large training data and a large number of categorical features you have to use embeddings.
Why were Embeddings developed?
If you have a large number of categorical features and you used one-hot encoding you will end up getting a huge sparse matrix with most of the elements as zero. This is not suitable for training ML models. Your data will suffer from the curse of dimensionality. With embeddings, you can essentially represent a large number of categorical features using a smaller dimension. Also, the output is a dense vector rather than a sparse vector.
Drawbacks of embeddings:
What size to select for embedding vector.
embedding_dimensions = vocab_size ** 0.25
Note: This is just a thumb rule. You can select embedding dimensions smaller or greater than this. The quality of word embedding increases with higher dimensionality. But after reaching some point, the marginal gain will diminish.