pandas numpy tensorflow lstm tensorflow2.0

Reshape the tabular time series data for LSTM binary classification model

I wanted to prepare data for LSTM binary classification model. I wanted to reshape my data into (num_samples,time_steps,num_features) shape. My training data set has the shape (2487576, 21). Here is my code for the toy data.

import pandas as pd
import numpy as np
url="https://gist.githubusercontent.com/JishanAhmed2019/7381979ecafb7efd456421c324d7963a/raw/a50a653119471cd4fe323d7680fe82a161727169/test.csv"
df=pd.read_csv(url,sep="\t")
def generate_train_data(X, y, sequence_length=2, step = 1):
    X_local = []
    y_local = []
    for start in range(0, len(df) - sequence_length, step):
        end = start + sequence_length
        X_local.append(X[start:end])
        y_local.append(y[end-1])
    return np.array(X_local), np.array(y_local)

train_X_sequence, train_y = generate_train_data(df.loc[:, "V1":"V2"].values, df.Class)

Output:

train_X_sequence

       array([
        [[ 30, 100],
        [ 40, 200]],

       [[ 40, 200],
        [ 50, 300]],

       [[ 50, 300],
        [ 60, 400]],

       [[ 60, 400],
        [ 70, 500]],

       [[ 70, 500],
        [ 80, 600]],

       [[ 80, 600],
        [ 90, 700]]])

train_y

array([0, 1, 0, 0, 0, 1])

I see that the last row is not showing up in the reshaped data. Is there anything I am missing here? I am using LSTM from tensorflow framework.

Solution

You need use the condition len(df) - sequence_length + 1 in the for loop.

for start in range(0, len(df) - sequence_length + 1, step):

Simple steps to prove it:

You want y[end-1] can access the last row of df['Class'] whose index is len(df)-1, so max(end) should be len(df)
end is equal to start + sequence_length, so that means max(start) should be len(df) - sequence_length
As a result, start in the for loop should be [0, len(df) - sequence_length + 1), where open bracket means value is included, close bracket means value is excluded.