Search code examples
pythonpandastensorflowkeras

Pandas Dataframe to sliding window


I have a pandas DataFrame:

timestamp A B C D
2023-09-27 14:05:50 1 2 3 4
2023-09-27 14:05:51 5 6 7 8
2023-09-27 14:05:52 9 10 11 12
2023-09-27 14:05:53 13 14 15 16
2023-09-27 14:05:54 17 18 19 20
2023-09-27 14:05:55 21 22 23 24

To feed it to a keras autoencoder I need a windowed version of the data (eg. window=3):

timestamp 0 1 2 3 4 5 6 7 8 9 10 11
2023-09-27 14:05:50 1 2 3 4 5 6 7 8 9 10 11 12
2023-09-27 14:05:51 5 6 7 8 9 10 11 12 13 14 15 16
2023-09-27 14:05:52 9 10 11 12 13 14 15 16 17 18 19 20
2023-09-27 14:05:53 13 14 15 16 17 18 19 20 21 22 23 24

I wrote a function but I think I might miss the point. I got problems in the later process and it takes an incredible amount of time (>10 hours, on a machine with 128 cores, a lot of RAM and 32 GPU cards) on my data.

def makeWindowDataFrame(df, windowSize):
    table = []
    for window in df.rolling(window=windowSize):
        if len(window) >= windowSize:
            arr = []
            for el in window.iloc:
                arr.extend(el.to_numpy().reshape(-1))
            table.append(arr)
    longest = len(max(table, key=len))
    return pd.DataFrame(table, columns=[a for a in range(longest)])

Is there a simpler way to create this dataset? This operation is on of the longest running in my setup.

EDIT 1:

def win(df, N):
    return pd.DataFrame(sliding_window_view(df, N, axis=0).swapaxes(1, 2).reshape(len(df)-N+1, -1), index=df.index[:len(df)-N+1])
df = pd.DataFrame( {'timestamp': {28384: pd.Timestamp('2023-09-27 14:05:50'), 28385: pd.Timestamp('2023-09-27 14:05:52'), 28386: pd.Timestamp('2023-09-27 14:05:54'), 28387: pd.Timestamp('2023-09-27 14:05:56'), 28388: pd.Timestamp('2023-09-27 14:05:58')}, 'p1l4e0': {28384: 0.8869906663894653, 28385: 0.9212895035743713, 28386: 0.9084778428077698, 28387: 0.8959079384803772, 28388: 0.9066142439842224}, 'p1l4e1': {28384: 0.3119787573814392, 28385: 0.31039634346961975, 28386: 0.3139703571796417, 28387: 0.3119153082370758, 28388: 0.30586937069892883}, 'p1l4e2': {28384: 0.9320452809333801, 28385: 0.9452565312385559, 28386: 0.9435424208641052, 28387: 0.9356696605682373, 28388: 0.9325512647628784}, 'p1l4e3': {28384: 0.10841193050146103, 28385: 0.1134769469499588, 28386: 0.11245745420455933, 28387: 0.109752357006073, 28388: 0.10924666374921799}} )

win(df, 3)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
28384 2023-09-27 14:05:50 0.886991 0.311979 0.932045 0.108412 2023-09-27 14:05:52 0.92129 0.310396 0.945257 0.113477 2023-09-27 14:05:54 0.908478 0.31397 0.943542
28385 2023-09-27 14:05:52 0.92129 0.310396 0.945257 0.113477 2023-09-27 14:05:54 0.908478 0.31397 0.943542 0.112457 2023-09-27 14:05:56 0.895908 0.311915 0.93567
28386 2023-09-27 14:05:54 0.908478 0.31397 0.943542 0.112457 2023-09-27 14:05:56 0.895908 0.311915 0.93567 0.109752 2023-09-27 14:05:58 0.906614 0.305869 0.932551

EDIT 2:

It seems like the index is not set. This would explain why it is not working.

df = df.set_index('timestamp')
df.head().to_dict('tight')
{
'index': [28384, 28385, 28386, 28387, 28388],
'columns': ['timestamp', 'p1l4e0', 'p1l4e1', 'p1l4e2', 'p1l4e3'],
'data': ...,
'index_names': [None],
'column_names': [None]
}

EDIT 3:

After a restart of the kernel it works. A test showed that the solution gains a speedup of 1000x at minimum and 1k-10k times faster on bigger datasets. Thank you.


Solution

  • Why not use 's sliding_window_view:

    from numpy.lib.stride_tricks import sliding_window_view as svw
    
    N = 3
    out = pd.DataFrame(svw(df, N, axis=0).swapaxes(1, 2).reshape(len(df)-N+1, -1),
                       index=df.index[:len(df)-N+1])
    

    Output:

                         0   1   2   3   4   5   6   7   8   9   10  11
    timestamp                                                          
    2023-09-27 14:05:50   1   2   3   4   5   6   7   8   9  10  11  12
    2023-09-27 14:05:51   5   6   7   8   9  10  11  12  13  14  15  16
    2023-09-27 14:05:52   9  10  11  12  13  14  15  16  17  18  19  20
    2023-09-27 14:05:53  13  14  15  16  17  18  19  20  21  22  23  24
    

    Output with N=2:

                          0   1   2   3   4   5   6   7
    timestamp                                          
    2023-09-27 14:05:50   1   2   3   4   5   6   7   8
    2023-09-27 14:05:51   5   6   7   8   9  10  11  12
    2023-09-27 14:05:52   9  10  11  12  13  14  15  16
    2023-09-27 14:05:53  13  14  15  16  17  18  19  20
    2023-09-27 14:05:54  17  18  19  20  21  22  23  24
    

    timings

    Your code is very slow for large inputs

    on 6 rows, N=3
    # original approach
    925 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    # sliding window view
    69.7 µs ± 266 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    
    on 60k rows, N=3
    # original approach
    6.66 s ± 57.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # sliding window view
    852 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    on 1.2M rows, N=3
    # original approach
    2min 16s ± 685 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # sliding window view
    28.5 ms ± 605 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    Handling extra columns:

    N = 3
    cols = ['A', 'B', 'C', 'D']
    
    out = (df[df.columns.difference(cols)].head(-N+1)
           .join(pd.DataFrame(svw(df[cols], N, axis=0)
                              .swapaxes(1, 2)
                              .reshape(len(df)-N+1, -1),
                              index=df.index[:len(df)-N+1])
                 )
           )