Search code examples
pythonpandastensorflowkerasrecurrent-neural-network

How can I load a Pandas DataFrame into a LSTM model?


I'm just playing with RNN's and was having trouble getting my data into the right format for my model. I have the following dataframe:

    Apple   Pears   Oranges ID
0   1.00    2.09    4.11    0
1   1.38    1.73    5.13    1
2   1.68    2.28    6.91    2
3   1.50    2.69    8.93    3
4   1.35    2.63    12.25   4
5   1.52    3.09    12.20   5
6   1.63    3.63    13.68   6
7   2.01    4.92    16.21   7
8   2.52    4.01    18.79   8
9   3.10    5.49    24.05   9

ID is a order/timesteps for my data.

I ran this command to try to load it into a timeseries dataset:

Dataset = keras.preprocessing.timeseries_dataset_from_array(priceHistorydf, basketHistorydf, sequence_length=10)

But when I try to train a model on this, I get the following error:

from tensorflow import keras
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import *

X_train = priceHistorydf
y_train = basketHistorydf

model = Sequential()
model.add(TimeDistributed(Dense(10), input_shape=(X_train.shape[1:])))
model.add(Bidirectional(LSTM(8)))

model.add(Dense(8, activation='tanh'))
model.add(Dense(8, activation='tanh'))
model.add(Dense(y_train.shape[-1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer="adam")

# history = model.fit(X_train, y_train, epochs=2, batch_size=8)
history = model.fit(Dataset, epochs=2, batch_size=8)

Error:

 ValueError: `TimeDistributed` Layer should be passed an `input_shape ` with at least 3 dimensions, received: [None, 4]

I'm just guessing but I realize I did not explicitly let the model know that ID is the timestep; but I'm unsure how to pass it to the model with my dataframe.

Any suggestions?


Solution

  • The main problem is that you are setting the input_shape argument incorrectly (i.e. X_train is the original data, and not the generated timeseries; hence X_train.shape[1:] is not correct as the input shape). Since you have used sequence_lenght=10 and each timestep has 3 features, therefore we should have input_shape=(10,3) (of course, assuming you would remove the ID column from the data first, because that's not a feature per say).

    As a side note: the Dense(...) and TimeDistributed(Dense(...)) are exactly the same, because the Dense layer is applied on the last axis by default. See this answer for more information and explanation.