python keras deep-learning lstm keras-layer

Keras: Categorical vs Continuous input to a LSTM

I am new to Keras and deep learning and after going through several tutorials and answers on stackoverflow, I am still unclear about how the input is manipulated once entering the network.

I am using the functional API of keras to develop complex models, so my first layer is always input layer. Something like:

Input()

LSTM()

Dense()

Now lets say I have 2 training datasets A and B. Each dataset is identical 10,000 by 6,000 matrix with 200 distinct values in it. i.e 10,000 rows each representing the training examples and 6,000 sequences of time steps. The values in both are [[3,50,1,22,7,5,3,1,5,..], [55,32,54,21,15, ...], .... ] The only difference between A and B is the the values in A are real numbers (continuous variable), and the values B are discreet (categorical variables).

I have the following 3 possible options which I can use to differentiate between categorical and continuous input and wanted to ask which of these will work, and which are better then others.

1- Given A is real valued and B is categorical, convert A to .astype(float) and B to .astype(float) and feed to the network and the network will assume accordingly.

2- Given B has categorical values, convert B to a one hot vector setting i.e changing 10,000 by 6,000 to 10,000 by 6,000 by 200. Keep A as it is.

3- If we are using B then add an embedding layer after input and making the network like:

Input()

Embedding()

LSTM()

Dense()

If we are using A then don't add embedding layer.

Solution

It seems the categorical input is confusing you. To embed or not to embed:

We embed categorical input using an Embedding layer for two cases: reduce the dimension of the space and capture any similarities between the input. So when you have billions of words in a language, it makes sense to embed into 300 dimensional vector to make it manageable. But one-hot always gives the most distinction, so in your case 200 is not a large number per say and one-hot is the way to go.
For the continuous input, we normalise often with a simple max-min normalisation, so max becomes 1 and min becomes 0. But there are many ways of doing it depending on the nature of your dataset.
For the actual model, you can 2 inputs that process continuous and categorical different and maybe share layers upstream, otherwise creating 2 different models might make sense.

You can find more information online that cover input encoding.