I wanted to prepare data for LSTM binary classification model. I wanted to reshape my data into (num_samples,time_steps,num_features) shape. My training data set has the shape (2487576, 21). Here is my code for the toy data.
import pandas as pd
import numpy as np
url="https://gist.githubusercontent.com/JishanAhmed2019/7381979ecafb7efd456421c324d7963a/raw/a50a653119471cd4fe323d7680fe82a161727169/test.csv"
df=pd.read_csv(url,sep="\t")
def generate_train_data(X, y, sequence_length=2, step = 1):
X_local = []
y_local = []
for start in range(0, len(df) - sequence_length, step):
end = start + sequence_length
X_local.append(X[start:end])
y_local.append(y[end-1])
return np.array(X_local), np.array(y_local)
train_X_sequence, train_y = generate_train_data(df.loc[:, "V1":"V2"].values, df.Class)
Output:
train_X_sequence
array([
[[ 30, 100],
[ 40, 200]],
[[ 40, 200],
[ 50, 300]],
[[ 50, 300],
[ 60, 400]],
[[ 60, 400],
[ 70, 500]],
[[ 70, 500],
[ 80, 600]],
[[ 80, 600],
[ 90, 700]]])
train_y
array([0, 1, 0, 0, 0, 1])
I see that the last row is not showing up in the reshaped data. Is there anything I am missing here? I am using LSTM from tensorflow framework.
You need use the condition len(df) - sequence_length + 1
in the for
loop.
for start in range(0, len(df) - sequence_length + 1, step):
Simple steps to prove it:
y[end-1]
can access the last row of df['Class']
whose index is len(df)-1
, so max(end)
should be len(df)
end
is equal to start + sequence_length
, so that means max(start)
should be len(df) - sequence_length
start
in the for
loop should be [0, len(df) - sequence_length + 1)
, where open bracket means value is included, close bracket means value is excluded.