let's say we have a data frame where we have a categorical column which has 7 categories - Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday. Let's say we have 100 data points and we want to give the categorical data as an input to the embedding layer and train the embedding layer using Keras. How do we actually achieve it? Can you share some intuition with code examples?
I have tried this code but it gives me an error which says "ValueError: "input_length" is 1, but received input has shape (None, 26)". I have referred to this blog https://medium.com/@satnalikamayank12/on-learning-embeddings-for-categorical-data-using-keras-165ff2773fc9, but I didn't get how to use it for my particular case.
from sklearn.preprocessing import LabelEncoder
l_encoder=LabelEncoder()
l_encoder.fit(X_train["Weekdays"])
encoded_weekdays_train=l_encoder.transform(X_train["Weekdays"])
encoded_weekdays_test=l_encoder.transform(X_test["Weekdays"])
no_of_unique_cat=len(X_train.school_state.unique())
embedding_size = min(np.ceil((no_of_unique_cat)/2),50)
embedding_size = int(embedding_size)
vocab = no_of_unique_cat+1
#Get the flattened LSTM output for categorical text
input_layer2 = Input(shape=(embedding_size,))
embedding = Embedding(input_dim=vocab, output_dim=embedding_size, input_length=1, trainable=True)(input_layer2)
flatten_school_state = Flatten()(embedding)
I want to know in case of 7 categories, what will be the shape of input_layer2? What should be the vocab size, output dim and input_length? Can anyone explain, or correct my code? Your insights will be really helpful.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-46-e28d41acae85> in <module>
1 #Get the flattened LSTM output for input text
2 input_layer2 = Input(shape=(embedding_size,))
----> 3 embedding = Embedding(input_dim=vocab, output_dim=embedding_size, input_length=1, trainable=True)(input_layer2)
4 flatten_school_state = Flatten()(embedding)
~/anaconda3/lib/python3.7/site-packages/keras/engine/base_layer.py in __call__(self, inputs, **kwargs)
472 if all([s is not None
473 for s in to_list(input_shape)]):
--> 474 output_shape = self.compute_output_shape(input_shape)
475 else:
476 if isinstance(input_shape, list):
~/anaconda3/lib/python3.7/site-packages/keras/layers/embeddings.py in compute_output_shape(self, input_shape)
131 raise ValueError(
132 '"input_length" is %s, but received input has shape %s' %
--> 133 (str(self.input_length), str(input_shape)))
134 elif s1 is None:
135 in_lens[i] = s2
ValueError: "input_length" is 1, but received input has shape (None, 26)
embedding_size
can never be the input size.
A Keras embedding takes "integers" as input. You should have your data as numbers from 0 to 6.
If your 100 data points form a sequence of days, you cannot restrict the length of the sequences in the embedding to 1.
Your input shape should be (length_of_sequence,)
. Which means your training data should have shape (any, length_of_sequence)
. Which is probably (1, 100)
by your description.
All the rest is automatic.