Search code examples
pythonpandasnumpydataframesampling

how to create sequences, from dataframe begin to every index, efficiently?


I have a timeseries dataframe, from which i would like to generate sequences.

Time                           A            B           C             D                                                               
2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71      
2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91      
2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41      
2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21     
2019-06-17 08:50:00     12504.11     12504.11    12504.11      12504.11     
2019-06-17 08:51:00     12504.11          NaN    12503.11      12503.11    
2019-06-17 08:52:00     12504.11     12504.11    12503.11      12503.11      
2019-06-17 08:53:00     12503.61     12503.61    12503.61      12503.61      
2019-06-17 08:54:00     12503.61     12503.61    12503.11      12503.11      

expected outcome is: (As you can see the samples get longer, always start from the beginning, but end at the next row. Note: I only left the indices inside, to clarify. They should not have to be in the outcome).

[                                                                                  
   [2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71  ],  

   [2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91   ],  

   [2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21   ],   

   [2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41   ],  

   [2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41     
   2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21    ], 

   [2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41     
   2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21     
   2019-06-17 08:50:00     12504.11     12504.11    12504.11      12504.11    ], 

   [2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41     
   2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21     
   2019-06-17 08:50:00     12504.11     12504.11    12504.11      12504.11     
   2019-06-17 08:51:00     12504.11          NaN    12503.11      12503.11    ],

   [2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41     
   2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21     
   2019-06-17 08:50:00     12504.11     12504.11    12504.11      12504.11     
   2019-06-17 08:51:00     12504.11          NaN    12503.11      12503.11    
   2019-06-17 08:52:00     12504.11     12504.11    12503.11      12503.11    ],  

   [2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41     
   2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21     
   2019-06-17 08:50:00     12504.11     12504.11    12504.11      12504.11     
   2019-06-17 08:51:00     12504.11          NaN    12503.11      12503.11    
   2019-06-17 08:52:00     12504.11     12504.11    12503.11      12503.11    
   2019-06-17 08:53:00     12503.61     12503.61    12503.61      12503.61     ],   

   [2019-06-17 08:45:00     12089.89     12089.89    12087.71      12087.71     
   2019-06-17 08:46:00     12087.91          NaN    12087.71      12087.91    
   2019-06-17 08:47:00     12088.21     12088.21    12084.21      12085.21      
   2019-06-17 08:48:00     12085.09     12090.21    12084.91      12089.41     
   2019-06-17 08:49:00     12089.71     12090.21    12087.21      12088.21     
   2019-06-17 08:50:00     12504.11     12504.11    12504.11      12504.11     
   2019-06-17 08:51:00     12504.11          NaN    12503.11      12503.11    
   2019-06-17 08:52:00     12504.11     12504.11    12503.11      12503.11    
   2019-06-17 08:53:00     12503.61     12503.61    12503.61      12503.61     
   2019-06-17 08:54:00     12503.61     12503.61    12503.11      12503.11  ]
                                                                               ]

How to do that?


Solution

  • I would use np.asarray or DataFrame.values, it is faster

    arr = np.asarray(df)
    #arr = df.values
    result = [arr[:i] for i in range(1, df.shape[0]+1)]
    

    Or

    arr = np.asarray(df)
    #arr = df.values
    result = list(map(lambda i: arr[:i], range(1, df.shape[0]+1)))
    

    %%timeit
    arr = np.asarray(df)
    result = list(map(lambda i: arr[:i], range(1,df.shape[0]+1)))
    24 µs ± 1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    Without np.asarray

    %%timeit
    result = list(map(lambda i: df[:i], range(1,df.shape[0]+1)))
    935 µs ± 37.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    You could also try DataFrame.shift, but it is slow

    %%timeit
    n = df.shape[0]
    [df.shift(i).dropna().values for i in reversed(range(n))]
    15.8 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    EDIT

    import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.random.rand(500000, 4))
    
    
    %%timeit
    arr = np.asarray(df)
    [arr[:i] for i in range(1, df.shape[0]+1)]
    
    213 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)