Search code examples
kerasnlplstmcategorical-dataword-embedding

I want to know how can we give a categorical variable as an input to an embedding layer in keras and train that embedding layer?


let's say we have a data frame where we have a categorical column which has 7 categories - Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday. Let's say we have 100 data points and we want to give the categorical data as an input to the embedding layer and train the embedding layer using Keras. How do we actually achieve it? Can you share some intuition with code examples?

I have tried this code but it gives me an error which says "ValueError: "input_length" is 1, but received input has shape (None, 26)". I have referred to this blog https://medium.com/@satnalikamayank12/on-learning-embeddings-for-categorical-data-using-keras-165ff2773fc9, but I didn't get how to use it for my particular case.

from sklearn.preprocessing import LabelEncoder
l_encoder=LabelEncoder()
l_encoder.fit(X_train["Weekdays"])

encoded_weekdays_train=l_encoder.transform(X_train["Weekdays"])
encoded_weekdays_test=l_encoder.transform(X_test["Weekdays"])

no_of_unique_cat=len(X_train.school_state.unique())
embedding_size = min(np.ceil((no_of_unique_cat)/2),50)
embedding_size = int(embedding_size)
vocab  = no_of_unique_cat+1

#Get the flattened LSTM output for categorical text
input_layer2 = Input(shape=(embedding_size,))
embedding = Embedding(input_dim=vocab, output_dim=embedding_size, input_length=1, trainable=True)(input_layer2)
flatten_school_state = Flatten()(embedding)

I want to know in case of 7 categories, what will be the shape of input_layer2? What should be the vocab size, output dim and input_length? Can anyone explain, or correct my code? Your insights will be really helpful.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-e28d41acae85> in <module>
      1 #Get the flattened LSTM output for input text
      2 input_layer2 = Input(shape=(embedding_size,))
----> 3 embedding = Embedding(input_dim=vocab, output_dim=embedding_size, input_length=1, trainable=True)(input_layer2)
      4 flatten_school_state = Flatten()(embedding)

~/anaconda3/lib/python3.7/site-packages/keras/engine/base_layer.py in __call__(self, inputs, **kwargs)
    472             if all([s is not None
    473                     for s in to_list(input_shape)]):
--> 474                 output_shape = self.compute_output_shape(input_shape)
    475             else:
    476                 if isinstance(input_shape, list):

~/anaconda3/lib/python3.7/site-packages/keras/layers/embeddings.py in compute_output_shape(self, input_shape)
    131                         raise ValueError(
    132                             '"input_length" is %s, but received input has shape %s' %
--> 133                             (str(self.input_length), str(input_shape)))
    134                     elif s1 is None:
    135                         in_lens[i] = s2

ValueError: "input_length" is 1, but received input has shape (None, 26)



Solution

  • embedding_size can never be the input size.

    A Keras embedding takes "integers" as input. You should have your data as numbers from 0 to 6.

    If your 100 data points form a sequence of days, you cannot restrict the length of the sequences in the embedding to 1.

    Your input shape should be (length_of_sequence,). Which means your training data should have shape (any, length_of_sequence). Which is probably (1, 100) by your description.

    All the rest is automatic.