Search code examples
pandasnumpytensorflowlstmtensorflow2.0

Reshape the tabular time series data for LSTM binary classification model


I wanted to prepare data for LSTM binary classification model. I wanted to reshape my data into (num_samples,time_steps,num_features) shape. My training data set has the shape (2487576, 21). Here is my code for the toy data.

import pandas as pd
import numpy as np
url="https://gist.githubusercontent.com/JishanAhmed2019/7381979ecafb7efd456421c324d7963a/raw/a50a653119471cd4fe323d7680fe82a161727169/test.csv"
df=pd.read_csv(url,sep="\t")
def generate_train_data(X, y, sequence_length=2, step = 1):
    X_local = []
    y_local = []
    for start in range(0, len(df) - sequence_length, step):
        end = start + sequence_length
        X_local.append(X[start:end])
        y_local.append(y[end-1])
    return np.array(X_local), np.array(y_local)

train_X_sequence, train_y = generate_train_data(df.loc[:, "V1":"V2"].values, df.Class)

Output:

train_X_sequence

       array([
        [[ 30, 100],
        [ 40, 200]],

       [[ 40, 200],
        [ 50, 300]],

       [[ 50, 300],
        [ 60, 400]],

       [[ 60, 400],
        [ 70, 500]],

       [[ 70, 500],
        [ 80, 600]],

       [[ 80, 600],
        [ 90, 700]]])
train_y

array([0, 1, 0, 0, 0, 1])

I see that the last row is not showing up in the reshaped data. Is there anything I am missing here? I am using LSTM from tensorflow framework.


Solution

  • You need use the condition len(df) - sequence_length + 1 in the for loop.

    for start in range(0, len(df) - sequence_length + 1, step):
    

    Simple steps to prove it:

    1. You want y[end-1] can access the last row of df['Class'] whose index is len(df)-1, so max(end) should be len(df)
    2. end is equal to start + sequence_length, so that means max(start) should be len(df) - sequence_length
    3. As a result, start in the for loop should be [0, len(df) - sequence_length + 1), where open bracket means value is included, close bracket means value is excluded.