I have a timeseries dataframe, from which i would like to generate sequences.
Time A B C D
2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21
2019-06-17 08:50:00 12504.11 12504.11 12504.11 12504.11
2019-06-17 08:51:00 12504.11 NaN 12503.11 12503.11
2019-06-17 08:52:00 12504.11 12504.11 12503.11 12503.11
2019-06-17 08:53:00 12503.61 12503.61 12503.61 12503.61
2019-06-17 08:54:00 12503.61 12503.61 12503.11 12503.11
expected outcome is: (As you can see the samples get longer, always start from the beginning, but end at the next row. Note: I only left the indices inside, to clarify. They should not have to be in the outcome).
[
[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71 ],
[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91 ],
[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21 ],
[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41 ],
[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21 ],
[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21
2019-06-17 08:50:00 12504.11 12504.11 12504.11 12504.11 ],
[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21
2019-06-17 08:50:00 12504.11 12504.11 12504.11 12504.11
2019-06-17 08:51:00 12504.11 NaN 12503.11 12503.11 ],
[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21
2019-06-17 08:50:00 12504.11 12504.11 12504.11 12504.11
2019-06-17 08:51:00 12504.11 NaN 12503.11 12503.11
2019-06-17 08:52:00 12504.11 12504.11 12503.11 12503.11 ],
[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21
2019-06-17 08:50:00 12504.11 12504.11 12504.11 12504.11
2019-06-17 08:51:00 12504.11 NaN 12503.11 12503.11
2019-06-17 08:52:00 12504.11 12504.11 12503.11 12503.11
2019-06-17 08:53:00 12503.61 12503.61 12503.61 12503.61 ],
[2019-06-17 08:45:00 12089.89 12089.89 12087.71 12087.71
2019-06-17 08:46:00 12087.91 NaN 12087.71 12087.91
2019-06-17 08:47:00 12088.21 12088.21 12084.21 12085.21
2019-06-17 08:48:00 12085.09 12090.21 12084.91 12089.41
2019-06-17 08:49:00 12089.71 12090.21 12087.21 12088.21
2019-06-17 08:50:00 12504.11 12504.11 12504.11 12504.11
2019-06-17 08:51:00 12504.11 NaN 12503.11 12503.11
2019-06-17 08:52:00 12504.11 12504.11 12503.11 12503.11
2019-06-17 08:53:00 12503.61 12503.61 12503.61 12503.61
2019-06-17 08:54:00 12503.61 12503.61 12503.11 12503.11 ]
]
How to do that?
I would use np.asarray
or DataFrame.values
, it is faster
arr = np.asarray(df)
#arr = df.values
result = [arr[:i] for i in range(1, df.shape[0]+1)]
Or
arr = np.asarray(df)
#arr = df.values
result = list(map(lambda i: arr[:i], range(1, df.shape[0]+1)))
%%timeit
arr = np.asarray(df)
result = list(map(lambda i: arr[:i], range(1,df.shape[0]+1)))
24 µs ± 1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Without np.asarray
%%timeit
result = list(map(lambda i: df[:i], range(1,df.shape[0]+1)))
935 µs ± 37.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You could also try DataFrame.shift
, but it is slow
%%timeit
n = df.shape[0]
[df.shift(i).dropna().values for i in reversed(range(n))]
15.8 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
EDIT
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(500000, 4))
%%timeit
arr = np.asarray(df)
[arr[:i] for i in range(1, df.shape[0]+1)]
213 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)